Re: [PATCH v2] powerpc/32: Don't use a struct based type for pte_t

2021-09-17 Thread Michael Ellerman
Christophe Leroy  writes:
> Long time ago we had a config item called STRICT_MM_TYPECHECKS
> to build the kernel with pte_t defined as a structure in order
> to perform additional build checks or build it with pte_t
> defined as a simple type in order to get simpler generated code.
>
> Commit 670eea924198 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS")
> made the struct based definition the only one, considering that the
> generated code was similar in both cases.
>
> That's right on ppc64 because the ABI is such that the content of a
> struct having a single simple type element is passed as register,
> but on ppc32 such a structure is passed via the stack like any
> structure.
>
> Simple test function:
>
>   pte_t test(pte_t pte)
>   {
>   return pte;
>   }
>
> Before this patch we get
>
>   c00108ec :
>   c00108ec:   81 24 00 00 lwz r9,0(r4)
>   c00108f0:   91 23 00 00 stw r9,0(r3)
>   c00108f4:   4e 80 00 20 blr
>
> So, for PPC32, restore the simple type behaviour we got before
> commit 670eea924198, but instead of adding a config option to
> activate type check, do it when __CHECKER__ is set so that type
> checking is performed by 'sparse' and provides feedback like:
>
>   arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return 
> expression (different base types)
>   arch/powerpc/mm/pgtable.c:466:16:expected unsigned long
>   arch/powerpc/mm/pgtable.c:466:16:got struct pte_t [usertype] x

OK that's a good trade off.

One question below ...

> diff --git a/arch/powerpc/include/asm/pgtable-types.h 
> b/arch/powerpc/include/asm/pgtable-types.h
> index d11b4c61d686..c60199fc6fa6 100644
> --- a/arch/powerpc/include/asm/pgtable-types.h
> +++ b/arch/powerpc/include/asm/pgtable-types.h
> @@ -5,14 +5,26 @@
>  /* PTE level */
>  #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
>  typedef struct { pte_basic_t pte, pte1, pte2, pte3; } pte_t;
> -#else
> +#elif defined(__CHECKER__) || !defined(CONFIG_PPC32)

It would be nicer if this logic was in Kconfig.

eg. restore config STRICT_MM_TYPECHECKS but make it always enabled for
64-bit, and depend on CHECKER for 32-bit.

The only thing is I'm not sure if we can test __CHECKER__ in Kconfig?

cheers


Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-17 Thread Peter Zijlstra
On Fri, Sep 17, 2021 at 05:25:02PM +0200, Vincent Guittot wrote:

> With the removal of the condition !sds->local_stat.sum_nr_running
> which seems useless because dst_cpu is idle and not SMT, this patch
> looks good to me

I've made it look like this. Thanks!

---
Subject: sched/fair: Consider SMT in ASYM_PACKING load balance
From: Ricardo Neri 
Date: Fri, 10 Sep 2021 18:18:19 -0700

From: Ricardo Neri 

When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.

If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.

Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
siblings of both dst_cpu and the CPUs in the candidate busiest group.

Signed-off-by: Ricardo Neri 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Joel Fernandes (Google) 
Reviewed-by: Len Brown 
Link: 
https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calde...@linux.intel.com
---
 kernel/sched/fair.c |   92 
 1 file changed, 92 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8538,10 +8538,96 @@ group_type group_classify(unsigned int i
return group_has_spare;
 }
 
+/**
+ * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
tasks
+ * @dst_cpu:   Destination CPU of the load balancing
+ * @sds:   Load-balancing data with statistics of the local group
+ * @sgs:   Load-balancing statistics of the candidate busiest group
+ * @sg:The candidate busiest group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks.
+ *
+ * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
+ * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
+ * only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
+ * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
priority.
+ * Bigger imbalances in the number of busy CPUs will be dealt with in
+ * update_sd_pick_busiest().
+ *
+ * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
siblings
+ * of @dst_cpu are idle and @sg has lower priority.
+ */
+static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+   struct sg_lb_stats *sgs,
+   struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+   bool local_is_smt, sg_is_smt;
+   int sg_busy_cpus;
+
+   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
+   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
+
+   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+   if (!local_is_smt) {
+   /*
+* If we are here, @dst_cpu is idle and does not have SMT
+* siblings. Pull tasks if candidate group has two or more
+* busy CPUs.
+*/
+   if (sg_busy_cpus >= 2) /* implies sg_is_smt */
+   return true;
+
+   /*
+* @dst_cpu does not have SMT siblings. @sg may have SMT
+* siblings and only one is busy. In such case, @dst_cpu
+* can help if it has higher priority and is idle (i.e.,
+* it has no running tasks).
+*/
+   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+   }
+
+   /* @dst_cpu has SMT siblings. */
+
+   if (sg_is_smt) {
+   int local_busy_cpus = sds->local->group_weight -
+ sds->local_stat.idle_cpus;
+   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
+
+   if (busy_cpus_delta == 1)
+   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+   return false;
+   }
+
+   /*
+* @sg does not have SMT siblings. Ensure that @sds::local does not end
+* up with more than one busy SMT sibling and only pull tasks if there
+* are not busy CPUs (i.e., no CPU has running tasks).
+*/
+   if (!sds->local_stat.sum_nr_running)
+   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+   return false;
+#else
+   /* Always return false so that callers deal with non-SMT cases. */
+   return false;
+#endif
+}
+
 static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats 
*sgs,
   struct sched_group *group)
 {
+   /* Only do SMT checks if either local or candidate have SMT siblings */
+   if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
+   (group->flags & SD_SHARE_CPUCAPACITY))
+   return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+

[PATCH 2/3] powerpc/inst: Define ppc_inst_t

2021-09-17 Thread Christophe Leroy
In order to stop using 'struct ppc_inst' on PPC32,
define a ppc_inst_t typedef.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/code-patching.h  | 18 +++
 arch/powerpc/include/asm/hw_breakpoint.h  |  4 +-
 arch/powerpc/include/asm/inst.h   | 34 ++--
 arch/powerpc/include/asm/sstep.h  |  4 +-
 arch/powerpc/kernel/align.c   |  4 +-
 arch/powerpc/kernel/epapr_paravirt.c  |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c   |  4 +-
 .../kernel/hw_breakpoint_constraints.c|  4 +-
 arch/powerpc/kernel/kprobes.c |  4 +-
 arch/powerpc/kernel/mce_power.c   |  2 +-
 arch/powerpc/kernel/optprobes.c   |  4 +-
 arch/powerpc/kernel/process.c |  2 +-
 arch/powerpc/kernel/setup_32.c|  2 +-
 arch/powerpc/kernel/trace/ftrace.c| 54 +--
 arch/powerpc/kernel/vecemu.c  |  2 +-
 arch/powerpc/lib/code-patching.c  | 38 ++---
 arch/powerpc/lib/feature-fixups.c |  4 +-
 arch/powerpc/lib/sstep.c  |  4 +-
 arch/powerpc/lib/test_emulate_step.c  | 10 ++--
 arch/powerpc/mm/maccess.c |  2 +-
 arch/powerpc/perf/8xx-pmu.c   |  2 +-
 arch/powerpc/xmon/xmon.c  | 14 ++---
 arch/powerpc/xmon/xmon_bpts.h |  4 +-
 23 files changed, 112 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index a95f63788c6b..a4ba9a5982fc 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -23,20 +23,20 @@
 #define BRANCH_ABSOLUTE0x2
 
 bool is_offset_in_branch_range(long offset);
-int create_branch(struct ppc_inst *instr, const u32 *addr,
+int create_branch(ppc_inst_t *instr, const u32 *addr,
  unsigned long target, int flags);
-int create_cond_branch(struct ppc_inst *instr, const u32 *addr,
+int create_cond_branch(ppc_inst_t *instr, const u32 *addr,
   unsigned long target, int flags);
 int patch_branch(u32 *addr, unsigned long target, int flags);
-int patch_instruction(u32 *addr, struct ppc_inst instr);
-int raw_patch_instruction(u32 *addr, struct ppc_inst instr);
+int patch_instruction(u32 *addr, ppc_inst_t instr);
+int raw_patch_instruction(u32 *addr, ppc_inst_t instr);
 
 static inline unsigned long patch_site_addr(s32 *site)
 {
return (unsigned long)site + *site;
 }
 
-static inline int patch_instruction_site(s32 *site, struct ppc_inst instr)
+static inline int patch_instruction_site(s32 *site, ppc_inst_t instr)
 {
return patch_instruction((u32 *)patch_site_addr(site), instr);
 }
@@ -57,11 +57,11 @@ static inline int modify_instruction_site(s32 *site, 
unsigned int clr, unsigned
return modify_instruction((unsigned int *)patch_site_addr(site), clr, 
set);
 }
 
-int instr_is_relative_branch(struct ppc_inst instr);
-int instr_is_relative_link_branch(struct ppc_inst instr);
+int instr_is_relative_branch(ppc_inst_t instr);
+int instr_is_relative_link_branch(ppc_inst_t instr);
 unsigned long branch_target(const u32 *instr);
-int translate_branch(struct ppc_inst *instr, const u32 *dest, const u32 *src);
-extern bool is_conditional_branch(struct ppc_inst instr);
+int translate_branch(ppc_inst_t *instr, const u32 *dest, const u32 *src);
+extern bool is_conditional_branch(ppc_inst_t instr);
 #ifdef CONFIG_PPC_BOOK3E_64
 void __patch_exception(int exc, unsigned long addr);
 #define patch_exception(exc, name) do { \
diff --git a/arch/powerpc/include/asm/hw_breakpoint.h 
b/arch/powerpc/include/asm/hw_breakpoint.h
index abebfbee5b1c..88053d3c68e6 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -56,11 +56,11 @@ static inline int nr_wp_slots(void)
return cpu_has_feature(CPU_FTR_DAWR1) ? 2 : 1;
 }
 
-bool wp_check_constraints(struct pt_regs *regs, struct ppc_inst instr,
+bool wp_check_constraints(struct pt_regs *regs, ppc_inst_t instr,
  unsigned long ea, int type, int size,
  struct arch_hw_breakpoint *info);
 
-void wp_get_instr_detail(struct pt_regs *regs, struct ppc_inst *instr,
+void wp_get_instr_detail(struct pt_regs *regs, ppc_inst_t *instr,
 int *type, int *size, unsigned long *ea);
 
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index fea4d46155a9..c4775f4c93a4 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -8,7 +8,7 @@
 ({ \
long __gui_ret; \
u32 __user *__gui_ptr = (u32 __user *)ptr;  \
-   struct ppc_inst __gui_inst;

[PATCH 1/3] powerpc/inst: Refactor ___get_user_instr()

2021-09-17 Thread Christophe Leroy
PPC64 version of ___get_user_instr() can be used for PPC32 as well,
by simply disabling the suffix part with IS_ENABLED(CONFIG_PPC64).

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index b11c0e2f9639..fea4d46155a9 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -4,8 +4,6 @@
 
 #include 
 
-#ifdef CONFIG_PPC64
-
 #define ___get_user_instr(gu_op, dest, ptr)\
 ({ \
long __gui_ret; \
@@ -16,7 +14,7 @@
__chk_user_ptr(ptr);\
__gui_ret = gu_op(__prefix, __gui_ptr); \
if (__gui_ret == 0) {   \
-   if ((__prefix >> 26) == OP_PREFIX) {\
+   if (IS_ENABLED(CONFIG_PPC64) && (__prefix >> 26) == OP_PREFIX) 
{ \
__gui_ret = gu_op(__suffix, __gui_ptr + 1); \
__gui_inst = ppc_inst_prefix(__prefix, __suffix); \
} else {\
@@ -27,13 +25,6 @@
}   \
__gui_ret;  \
 })
-#else /* !CONFIG_PPC64 */
-#define ___get_user_instr(gu_op, dest, ptr)\
-({ \
-   __chk_user_ptr(ptr);\
-   gu_op((dest).val, (u32 __user *)(ptr)); \
-})
-#endif /* CONFIG_PPC64 */
 
 #define get_user_instr(x, ptr) ___get_user_instr(get_user, x, ptr)
 
-- 
2.31.1



[PATCH 3/3] powerpc/inst: Define ppc_inst_t as u32 on PPC32

2021-09-17 Thread Christophe Leroy
Unlike PPC64 ABI, PPC32 uses the stack to pass a parameter defined
as a struct, even when the struct has a single simple element.

To avoid that, define ppc_inst_t as u32 on PPC32.

Keep it as 'struct ppc_inst' when __CHECKER__ is defined so that
sparse can perform type checking.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index c4775f4c93a4..2f8d189e4d76 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -41,6 +41,7 @@ struct ppc_inst {
 #endif
 } __packed;
 
+#if defined(CONFIG_PPC64) || defined(__CHECKER__)
 typedef struct ppc_inst ppc_inst_t;
 
 static inline u32 ppc_inst_val(ppc_inst_t x)
@@ -48,13 +49,23 @@ static inline u32 ppc_inst_val(ppc_inst_t x)
return x.val;
 }
 
+#define ppc_inst(x) ((ppc_inst_t){ .val = (x) })
+
+#else
+typedef u32 ppc_inst_t;
+
+static inline u32 ppc_inst_val(ppc_inst_t x)
+{
+   return x;
+}
+#define ppc_inst(x) (x)
+#endif
+
 static inline int ppc_inst_primary_opcode(ppc_inst_t x)
 {
return ppc_inst_val(x) >> 26;
 }
 
-#define ppc_inst(x) ((ppc_inst_t){ .val = (x) })
-
 #ifdef CONFIG_PPC64
 #define ppc_inst_prefix(x, y) ((ppc_inst_t){ .val = (x), .suffix = (y) })
 
-- 
2.31.1



[powerpc:fixes-test] BUILD SUCCESS c006a06508db4841d256d82f42da392d6391f3d9

2021-09-17 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: c006a06508db4841d256d82f42da392d6391f3d9  powerpc/xics: Set the 
IRQ chip data for the ICS native backend

elapsed time: 1508m

configs tested: 138
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-20210916
sh  rsk7269_defconfig
powerpc ep8248e_defconfig
riscvnommu_virt_defconfig
powerpc  chrp32_defconfig
arm bcm2835_defconfig
arm at91_dt_defconfig
s390 allyesconfig
armhisi_defconfig
sh   se7780_defconfig
armqcom_defconfig
nios2 10m50_defconfig
arc nsimosci_hs_smp_defconfig
mipsar7_defconfig
arm   netwinder_defconfig
sh  landisk_defconfig
um   x86_64_defconfig
armmvebu_v7_defconfig
arcnsimosci_defconfig
mips loongson1b_defconfig
sh   se7712_defconfig
powerpcwarp_defconfig
arm  integrator_defconfig
powerpc mpc837x_rdb_defconfig
powerpc tqm8541_defconfig
powerpcklondike_defconfig
powerpc tqm8548_defconfig
powerpc kmeter1_defconfig
powerpc  mpc885_ads_defconfig
arm cm_x300_defconfig
sh   alldefconfig
m68k amcore_defconfig
armoxnas_v6_defconfig
ia64  tiger_defconfig
powerpc  katmai_defconfig
armmini2440_defconfig
armtrizeps4_defconfig
xtensa  cadence_csp_defconfig
arm  alldefconfig
powerpc powernv_defconfig
mips  rm200_defconfig
mips cu1000-neo_defconfig
arm  ep93xx_defconfig
xtensa  audio_kc705_defconfig
xtensa  iss_defconfig
sh  defconfig
x86_64   allyesconfig
powerpc  storcenter_defconfig
armkeystone_defconfig
arm ezx_defconfig
mips   ip27_defconfig
arm   imx_v6_v7_defconfig
arm am200epdkit_defconfig
m68kmac_defconfig
mipsnlm_xlr_defconfig
x86_64   randconfig-c001-20210916
arm  randconfig-c002-20210916
ia64 allmodconfig
ia64 allyesconfig
ia64defconfig
m68k allmodconfig
m68k allyesconfig
m68kdefconfig
nios2   defconfig
nds32 allnoconfig
arc  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
xtensa   allyesconfig
parisc   allyesconfig
parisc  defconfig
s390 allmodconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
i386 allyesconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc   allnoconfig
powerpc  allmodconfig
x86_64   randconfig-a016-20210916
x86_64   randconfig-a013-20210916
x86_64   randconfig-a012-20210916
x86_64   randconfig-a011-20210916
x86_64   randconfig-a014-20210916
x86_64   randconfig-a015-20210916
i386 randconfig-a016-20210916
i386   

Re: [PATCH v2 8/8] bpf ppc32: Add addr > TASK_SIZE_MAX explicit check

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

With KUAP enabled, any kernel code which wants to access userspace
needs to be surrounded by disable-enable KUAP. But that is not
happening for BPF_PROBE_MEM load instruction. Though PPC32 does not
support read protection, considering the fact that PTR_TO_BTF_ID
(which uses BPF_PROBE_MEM mode) could either be a valid kernel pointer
or NULL but should never be a pointer to userspace address, execute
BPF_PROBE_MEM load only if addr > TASK_SIZE_MAX, otherwise set
dst_reg=0 and move on.


Same comment as patch 6.



This will catch NULL, valid or invalid userspace pointers. Only bad
kernel pointer will be handled by BPF exception table.

[Alexei suggested for x86]
Suggested-by: Alexei Starovoitov 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to handle bad userspace pointers on PPC32.


  arch/powerpc/net/bpf_jit_comp32.c | 39 +++
  1 file changed, 39 insertions(+)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index c6262289dcc4..faa8047fcf4a 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -821,6 +821,45 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/* dst = *(u64 *)(ul) (src + off) */
case BPF_LDX | BPF_MEM | BPF_DW:
case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
+   /*
+* As PTR_TO_BTF_ID that uses BPF_PROBE_MEM mode could 
either be a valid
+* kernel pointer or NULL but not a userspace address, 
execute BPF_PROBE_MEM
+* load only if addr > TASK_SIZE_MAX, otherwise set 
dst_reg=0 and move on.
+*/
+   if (BPF_MODE(code) == BPF_PROBE_MEM) {
+   bool extra_insn_needed = false;
+   unsigned int adjusted_idx;
+
+   /*
+* For BPF_DW case, "li reg_h,0" would be 
needed when
+* !fp->aux->verifier_zext. Adjust conditional 
branch'ing
+* address accordingly.
+*/
+   if ((size == BPF_DW) && !fp->aux->verifier_zext)
+   extra_insn_needed = true;


Don't make it too complicated. That's a fallback that should never 
happen, no need to optimise. You can put that instruction all the time 
(or put a NOP) and keep the jumps always the same.



+
+   /*
+* Need to jump two instructions instead of one 
for BPF_DW case
+* as there are two load instructions for dst_reg_h 
& dst_reg
+* respectively.
+*/
+   adjusted_idx = (size == BPF_DW) ? 1 : 0;


Same comment as patch 6, drop adjusted_idx and do an if/else directly 
for the PPC_JMP.



+
+   EMIT(PPC_RAW_ADDI(b2p[TMP_REG], src_reg, off));
+   PPC_LI32(_R0, TASK_SIZE_MAX);
+   EMIT(PPC_RAW_CMPLW(b2p[TMP_REG], _R0));
+   PPC_BCC(COND_GT, (ctx->idx + 4 + 
(extra_insn_needed ? 1 : 0)) * 4);
+   EMIT(PPC_RAW_LI(dst_reg, 0));
+   /*
+* Note that "li reg_h,0" is emitted for 
BPF_B/H/W case,
+* if necessary. So, jump there insted of 
emitting an
+* additional "li reg_h,0" instruction.
+*/
+   if (extra_insn_needed)
+   EMIT(PPC_RAW_LI(dst_reg_h, 0));
+   PPC_JMP((ctx->idx + 2 + adjusted_idx) * 4);
+   }
+
switch (size) {
case BPF_B:
EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));



Re: [PATCH v2 6/8] bpf ppc64: Add addr > TASK_SIZE_MAX explicit check

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

From: Ravi Bangoria 

On PPC64 with KUAP enabled, any kernel code which wants to
access userspace needs to be surrounded by disable-enable KUAP.
But that is not happening for BPF_PROBE_MEM load instruction.
So, when BPF program tries to access invalid userspace address,
page-fault handler considers it as bad KUAP fault:

   Kernel attempted to read user page (d000) - exploit attempt? (uid: 0)

Considering the fact that PTR_TO_BTF_ID (which uses BPF_PROBE_MEM
mode) could either be a valid kernel pointer or NULL but should
never be a pointer to userspace address, execute BPF_PROBE_MEM load
only if addr > TASK_SIZE_MAX, otherwise set dst_reg=0 and move on.


You should do like copy_from_kernel_nofault_allowed() and use the same 
criterias as is_kernel_addr() instead of using TASK_SIZE_MAX.




This will catch NULL, valid or invalid userspace pointers. Only bad
kernel pointer will be handled by BPF exception table.

[Alexei suggested for x86]
Suggested-by: Alexei Starovoitov 
Signed-off-by: Ravi Bangoria 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Refactored the code based on Christophe's comments.


  arch/powerpc/net/bpf_jit_comp64.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 2fc10995f243..eb28dbc67151 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -769,6 +769,29 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/* dst = *(u64 *)(ul) (src + off) */
case BPF_LDX | BPF_MEM | BPF_DW:
case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
+   /*
+* As PTR_TO_BTF_ID that uses BPF_PROBE_MEM mode could 
either be a valid
+* kernel pointer or NULL but not a userspace address, 
execute BPF_PROBE_MEM
+* load only if addr > TASK_SIZE_MAX, otherwise set 
dst_reg=0 and move on.
+*/
+   if (BPF_MODE(code) == BPF_PROBE_MEM) {
+   unsigned int adjusted_idx;
+
+   /*
+* Check if 'off' is word aligned because 
PPC_BPF_LL()
+* (BPF_DW case) generates two instructions if 
'off' is not
+* word-aligned and one instruction otherwise.
+*/
+   adjusted_idx = ((BPF_SIZE(code) == BPF_DW) && (off 
& 3)) ? 1 : 0;


No need of ( ) around 'BPF_SIZE(code) == BPF_DW'


+
+   EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], src_reg, 
off));
+   PPC_LI64(b2p[TMP_REG_2], TASK_SIZE_MAX);
+   EMIT(PPC_RAW_CMPLD(b2p[TMP_REG_1], 
b2p[TMP_REG_2]));
+   PPC_BCC(COND_GT, (ctx->idx + 4) * 4);
+   EMIT(PPC_RAW_LI(dst_reg, 0));
+   PPC_JMP((ctx->idx + 2 + adjusted_idx) * 4);


I think it would be more explicit if you drop adjusted_idx and do :

if (BPF_SIZE(code) == BPF_DW) && (off & 3)
PPC_JMP((ctx->idx + 3) * 4);
else
PPC_JMP((ctx->idx + 2) * 4);


+   }
+
switch (size) {
case BPF_B:
EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));



Re: [PATCH v2 7/8] bpf ppc32: Add BPF_PROBE_MEM support for JIT

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

BPF load instruction with BPF_PROBE_MEM mode can cause a fault
inside kernel. Append exception table for such instructions
within BPF program.

Unlike other archs which uses extable 'fixup' field to pass dest_reg
and nip, BPF exception table on PowerPC follows the generic PowerPC
exception table design, where it populates both fixup and extable
sections within BPF program. fixup section contains 3 instructions,
first 2 instructions clear dest_reg (lower & higher 32-bit registers)
and last instruction jumps to next instruction in the BPF code.
extable 'insn' field contains relative offset of the instruction and
'fixup' field contains relative offset of the fixup entry. Example
layout of BPF program with extable present:

  +--+
  |  |
  |  |
0x4020 -->| lwz   r28,4(r4)  |
  |  |
  |  |
0x40ac -->| lwz  r3,0(r24)   |
  | lwz  r4,4(r24)   |
  |  |
  |  |
  |--|
0x4278 -->| li  r28,0|  \
  | li  r27,0|  | fixup entry
  | b   0x4024   |  /
0x4284 -->| li  r4,0 |
  | li  r3,0 |
  | b   0x40b4   |
  |--|
0x4290 -->| insn=0xfd90  |  \ extable entry
  | fixup=0xffe4 |  /
0x4298 -->| insn=0xfe14  |
  | fixup=0xffe8 |
  +--+

(Addresses shown here are chosen random, not real)

Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to add BPF_PROBE_MEM support for PPC32.


  arch/powerpc/net/bpf_jit.h|  7 +
  arch/powerpc/net/bpf_jit_comp.c   | 50 +++
  arch/powerpc/net/bpf_jit_comp32.c | 30 +++
  arch/powerpc/net/bpf_jit_comp64.c | 48 ++---
  4 files changed, 89 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 6357c71c26eb..6a591ef88006 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -144,7 +144,11 @@ struct codegen_context {
unsigned int exentry_idx;
  };
  
+#ifdef CONFIG_PPC32

+#define BPF_FIXUP_LEN  12 /* Three instructions */


Could use 3 and 2 instead of 12 and 8, see later why.


+#else
  #define BPF_FIXUP_LEN 8 /* Two instructions */
+#endif
  
  static inline void bpf_flush_icache(void *start, void *end)

  {
@@ -174,6 +178,9 @@ void bpf_jit_build_prologue(u32 *image, struct 
codegen_context *ctx);
  void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
  void bpf_jit_realloc_regs(struct codegen_context *ctx);
  
+int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct codegen_context *ctx,

+ int insn_idx, int jmp_off, int dst_reg);
+
  #endif
  
  #endif

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index e92bd79d3bac..a1753b8c78c8 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -271,3 +271,53 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
  
  	return fp;

  }
+
+/*
+ * The caller should check for (BPF_MODE(code) == BPF_PROBE_MEM) before calling
+ * this function, as this only applies to BPF_PROBE_MEM, for now.
+ */
+int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct 
codegen_context *ctx,
+ int insn_idx, int jmp_off, int dst_reg)
+{


Modify patch 5 to get that function already in 
arch/powerpc/net/bpf_jit_comp.c, so that only changes/additions to the 
function appear here.



And you can have the prototype ready for the final version in patch 5 
instead of adding new arguments here and having to change ppc64 call site.


And in fact you can use them already in patch 5, like jmp_off.


+   off_t offset;
+   unsigned long pc;
+   struct exception_table_entry *ex;
+   u32 *fixup;
+
+   /* Populate extable entries only in the last pass */
+   if (pass != 2)
+   return 0;
+
+   if (!fp->aux->extable ||
+   WARN_ON_ONCE(ctx->exentry_idx >= fp->aux->num_exentries))
+   return -EINVAL;
+
+   pc = (unsigned long)[insn_idx];
+
+   fixup = (void *)fp->aux->extable -
+   (fp->aux->num_exentries * BPF_FIXUP_LEN) +
+   (ctx->exentry_idx * BPF_FIXUP_LEN);


Use 2 or 3 for BPF_FIXUP_LEN and multiply by 4 here.


+
+   fixup[0] = PPC_RAW_LI(dst_reg, 0);
+#ifdef CONFIG_PPC32


You should use 'if (IS_ENABLED(CONFIG_PPC32)' instead of a #ifdef here.


+   fixup[1] = PPC_RAW_LI(dst_reg - 1, 0); /* clear higher 32-bit register 
too */
+   fixup[2] = PPC_RAW_BRANCH((long)(pc + jmp_off) - (long)[2]);
+#else
+   fixup[1] = PPC_RAW_BRANCH((long)(pc + jmp_off) - 

Re: [PATCH v2 3/8] bpf powerpc: refactor JIT compiler code

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

Refactor powerpc JITing. This simplifies adding BPF_PROBE_MEM support.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to refactor a bit of JITing code.


  arch/powerpc/net/bpf_jit_comp32.c | 50 +++-
  arch/powerpc/net/bpf_jit_comp64.c | 76 ---
  2 files changed, 68 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index b60b59426a24..c8ae14c316e3 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -276,17 +276,17 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
u32 exit_addr = addrs[flen];
  
  	for (i = 0; i < flen; i++) {

-   u32 code = insn[i].code;
u32 dst_reg = bpf_to_ppc(ctx, insn[i].dst_reg);
-   u32 dst_reg_h = dst_reg - 1;
u32 src_reg = bpf_to_ppc(ctx, insn[i].src_reg);
-   u32 src_reg_h = src_reg - 1;
u32 tmp_reg = bpf_to_ppc(ctx, TMP_REG);
+   u32 true_cond, code = insn[i].code;
+   u32 dst_reg_h = dst_reg - 1;
+   u32 src_reg_h = src_reg - 1;
+   u32 size = BPF_SIZE(code);
s16 off = insn[i].off;
s32 imm = insn[i].imm;
bool func_addr_fixed;
u64 func_addr;
-   u32 true_cond;
  
  		/*

 * addrs[] maps a BPF bytecode address into a real offset from
@@ -809,25 +809,33 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/*
 * BPF_LDX
 */
-   case BPF_LDX | BPF_MEM | BPF_B: /* dst = *(u8 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   break;
-   case BPF_LDX | BPF_MEM | BPF_H: /* dst = *(u16 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   break;
-   case BPF_LDX | BPF_MEM | BPF_W: /* dst = *(u32 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
+   /* dst = *(u8 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_B:
+   /* dst = *(u16 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_H:
+   /* dst = *(u32 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_W:
+   /* dst = *(u64 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_DW:


You have to add the 'fallthrough;' keyword to avoid build failure with 
later versions of GCC.




+   switch (size) {
+   case BPF_B:
+   EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
+   break;
+   case BPF_H:
+   EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
+   break;
+   case BPF_W:
+   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
+   break;
+   case BPF_DW:
+   EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
+   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
+   break;
+   }
+
+   if ((size != BPF_DW) && !fp->aux->verifier_zext)
EMIT(PPC_RAW_LI(dst_reg_h, 0));
break;
-   case BPF_LDX | BPF_MEM | BPF_DW: /* dst = *(u64 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
-   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
-   break;
  
  		/*

 * Doubleword load
diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 2a87da50d9a4..78b28f2c 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -282,16 +282,15 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
u32 exit_addr = addrs[flen];
  
  	for (i = 0; i < flen; i++) {

-   u32 code = insn[i].code;
u32 dst_reg = b2p[insn[i].dst_reg];
u32 src_reg = b2p[insn[i].src_reg];
+   u32 tmp_idx, code = insn[i].code;
+   u32 size = BPF_SIZE(code);
s16 off = insn[i].off;
s32 imm = insn[i].imm;
bool func_addr_fixed;
-   

Re: [PATCH v2 5/8] bpf ppc64: Add BPF_PROBE_MEM support for JIT

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

From: Ravi Bangoria 

BPF load instruction with BPF_PROBE_MEM mode can cause a fault
inside kernel. Append exception table for such instructions
within BPF program.

Unlike other archs which uses extable 'fixup' field to pass dest_reg
and nip, BPF exception table on PowerPC follows the generic PowerPC
exception table design, where it populates both fixup and extable
sections within BPF program. fixup section contains two instructions,
first instruction clears dest_reg and 2nd jumps to next instruction
in the BPF code. extable 'insn' field contains relative offset of
the instruction and 'fixup' field contains relative offset of the
fixup entry. Example layout of BPF program with extable present:

  +--+
  |  |
  |  |
0x4020 -->| ld   r27,4(r3)   |
  |  |
  |  |
0x40ac -->| lwz  r3,0(r4)|
  |  |
  |  |
  |--|
0x4280 -->| li  r27,0|  \ fixup entry
  | b   0x4024   |  /
0x4288 -->| li  r3,0 |
  | b   0x40b0   |
  |--|
0x4290 -->| insn=0xfd90  |  \ extable entry
  | fixup=0xffec |  /
0x4298 -->| insn=0xfe14  |
  | fixup=0xffec |
  +--+

(Addresses shown here are chosen random, not real)

Signed-off-by: Ravi Bangoria 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Used JITing code after refactoring.
* Replaced 'xor reg,reg,reg' with 'li reg,0' where appropriate.
* Avoided unnecessary init during declaration.


  arch/powerpc/net/bpf_jit.h|  5 ++-
  arch/powerpc/net/bpf_jit_comp.c   | 25 ++
  arch/powerpc/net/bpf_jit_comp32.c |  2 +-
  arch/powerpc/net/bpf_jit_comp64.c | 57 ++-
  4 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 0c8f885b8f48..6357c71c26eb 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -141,8 +141,11 @@ struct codegen_context {
unsigned int idx;
unsigned int stack_size;
int b2p[ARRAY_SIZE(b2p)];
+   unsigned int exentry_idx;
  };
  
+#define BPF_FIXUP_LEN	8 /* Two instructions */

+
  static inline void bpf_flush_icache(void *start, void *end)
  {
smp_wmb();  /* smp write barrier */
@@ -166,7 +169,7 @@ static inline void bpf_clear_seen_register(struct 
codegen_context *ctx, int i)
  
  void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 func);

  int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct 
codegen_context *ctx,
-  u32 *addrs);
+  u32 *addrs, int pass);
  void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
  void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
  void bpf_jit_realloc_regs(struct codegen_context *ctx);
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index c5c9e8ad1de7..e92bd79d3bac 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -101,6 +101,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
struct bpf_prog *tmp_fp;
bool bpf_blinded = false;
bool extra_pass = false;
+   u32 extable_len;
+   u32 fixup_len;
  
  	if (!fp->jit_requested)

return org_fp;
@@ -131,7 +133,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
image = jit_data->image;
bpf_hdr = jit_data->header;
proglen = jit_data->proglen;
-   alloclen = proglen + FUNCTION_DESCR_SIZE;
extra_pass = true;
goto skip_init_ctx;
}
@@ -149,7 +150,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
  
  	/* Scouting faux-generate pass 0 */

-   if (bpf_jit_build_body(fp, 0, , addrs)) {
+   if (bpf_jit_build_body(fp, 0, , addrs, 0)) {
/* We hit something illegal or unsupported. */
fp = org_fp;
goto out_addrs;
@@ -162,7 +163,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 */
if (cgctx.seen & SEEN_TAILCALL) {
cgctx.idx = 0;
-   if (bpf_jit_build_body(fp, 0, , addrs)) {
+   if (bpf_jit_build_body(fp, 0, , addrs, 0)) {
fp = org_fp;
goto out_addrs;
}
@@ -177,8 +178,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
bpf_jit_build_prologue(0, );
bpf_jit_build_epilogue(0, );
  
+	fixup_len = fp->aux->num_exentries * BPF_FIXUP_LEN;

+   extable_len = 

Re: [PATCH v2 4/8] powerpc/ppc-opcode: introduce PPC_RAW_BRANCH() macro

2021-09-17 Thread LEROY Christophe


Le 17/09/2021 à 17:30, Hari Bathini a écrit :
> Define and use PPC_RAW_BRANCH() macro instead of open coding it. This
> macro is used while adding BPF_PROBE_MEM support.
> 
> Signed-off-by: Hari Bathini 

Reviewed-by: Christophe Leroy 

> ---
> 
> Changes in v2:
> * New patch to introduce PPC_RAW_BRANCH() macro.
> 
> 
>   arch/powerpc/include/asm/ppc-opcode.h | 2 ++
>   arch/powerpc/net/bpf_jit.h| 4 ++--
>   2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index baea657bc868..f50213e2a3e0 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -566,6 +566,8 @@
>   #define PPC_RAW_MTSPR(spr, d)   (0x7c0003a6 | ___PPC_RS(d) | 
> __PPC_SPR(spr))
>   #define PPC_RAW_EIEIO() (0x7c0006ac)
>   
> +#define PPC_RAW_BRANCH(addr) (PPC_INST_BRANCH | ((addr) & 
> 0x03fc))
> +
>   /* Deal with instructions that older assemblers aren't aware of */
>   #define PPC_BCCTR_FLUSH stringify_in_c(.long 
> PPC_INST_BCCTR_FLUSH)
>   #define PPC_CP_ABORTstringify_in_c(.long PPC_RAW_CP_ABORT)
> diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
> index 411c63d945c7..0c8f885b8f48 100644
> --- a/arch/powerpc/net/bpf_jit.h
> +++ b/arch/powerpc/net/bpf_jit.h
> @@ -24,8 +24,8 @@
>   #define EMIT(instr) PLANT_INSTR(image, ctx->idx, instr)
>   
>   /* Long jump; (unconditional 'branch') */
> -#define PPC_JMP(dest)EMIT(PPC_INST_BRANCH |  
>   \
> -  (((dest) - (ctx->idx * 4)) & 0x03fc))
> +#define PPC_JMP(dest)EMIT(PPC_RAW_BRANCH((dest) - (ctx->idx 
> * 4)))
> +
>   /* blr; (unconditional 'branch' with link) to absolute address */
>   #define PPC_BL_ABS(dest)EMIT(PPC_INST_BL |\
>(((dest) - (unsigned long)(image + 
> ctx->idx)) & 0x03fc))
> 

Re: [PATCH v2 3/8] bpf powerpc: refactor JIT compiler code

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

Refactor powerpc JITing. This simplifies adding BPF_PROBE_MEM support.


Could you describe a bit more what you are refactoring exactly ?




Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to refactor a bit of JITing code.


  arch/powerpc/net/bpf_jit_comp32.c | 50 +++-
  arch/powerpc/net/bpf_jit_comp64.c | 76 ---
  2 files changed, 68 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index b60b59426a24..c8ae14c316e3 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -276,17 +276,17 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
u32 exit_addr = addrs[flen];
  
  	for (i = 0; i < flen; i++) {

-   u32 code = insn[i].code;
u32 dst_reg = bpf_to_ppc(ctx, insn[i].dst_reg);
-   u32 dst_reg_h = dst_reg - 1;
u32 src_reg = bpf_to_ppc(ctx, insn[i].src_reg);
-   u32 src_reg_h = src_reg - 1;
u32 tmp_reg = bpf_to_ppc(ctx, TMP_REG);
+   u32 true_cond, code = insn[i].code;
+   u32 dst_reg_h = dst_reg - 1;
+   u32 src_reg_h = src_reg - 1;


All changes above seems unneeded and not linked to the current patch. 
Please leave cosmetic changes outside and focus on necessary changes.



+   u32 size = BPF_SIZE(code);
s16 off = insn[i].off;
s32 imm = insn[i].imm;
bool func_addr_fixed;
u64 func_addr;
-   u32 true_cond;
  
  		/*

 * addrs[] maps a BPF bytecode address into a real offset from
@@ -809,25 +809,33 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/*
 * BPF_LDX
 */
-   case BPF_LDX | BPF_MEM | BPF_B: /* dst = *(u8 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   break;
-   case BPF_LDX | BPF_MEM | BPF_H: /* dst = *(u16 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   break;
-   case BPF_LDX | BPF_MEM | BPF_W: /* dst = *(u32 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
+   /* dst = *(u8 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_B:
+   /* dst = *(u16 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_H:
+   /* dst = *(u32 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_W:
+   /* dst = *(u64 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_DW:

Why changing the location of the comments ? I found it more readable before.


+   switch (size) {
+   case BPF_B:
+   EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
+   break;
+   case BPF_H:
+   EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
+   break;
+   case BPF_W:
+   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
+   break;
+   case BPF_DW:
+   EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
+   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
+   break;
+   }


BPF_B, BPF_H, ... are not part of an enum. Are you sure GCC is happy to 
have no default ?



+
+   if ((size != BPF_DW) && !fp->aux->verifier_zext)


You don't need () around size != BPF_DW


EMIT(PPC_RAW_LI(dst_reg_h, 0));
break;
-   case BPF_LDX | BPF_MEM | BPF_DW: /* dst = *(u64 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
-   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
-   break;
  
  		/*

 * Doubleword load
diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 2a87da50d9a4..78b28f2c 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -282,16 +282,15 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
u32 exit_addr = addrs[flen];
  
  	for (i = 0; i < flen; i++) {

-   u32 code = 

Re: [PATCH v2 1/8] bpf powerpc: Remove unused SEEN_STACK

2021-09-17 Thread Christophe Leroy




Le 17/09/2021 à 17:30, Hari Bathini a écrit :

From: Ravi Bangoria 

SEEN_STACK is unused on PowerPC. Remove it. Also, have
SEEN_TAILCALL use 0x4000.

Signed-off-by: Ravi Bangoria 


Reviewed-by: Christophe Leroy 


---

* No changes in v2.


  arch/powerpc/net/bpf_jit.h | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 99fad093f43e..d6267e93027a 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -116,8 +116,7 @@ static inline bool is_nearbranch(int offset)
  #define COND_LE   (CR0_GT | COND_CMP_FALSE)
  
  #define SEEN_FUNC	0x2000 /* might call external helpers */

-#define SEEN_STACK 0x4000 /* uses BPF stack */
-#define SEEN_TAILCALL  0x8000 /* uses tail calls */
+#define SEEN_TAILCALL  0x4000 /* uses tail calls */
  
  #define SEEN_VREG_MASK	0x1ff8 /* Volatile registers r3-r12 */

  #define SEEN_NVREG_MASK   0x0003 /* Non volatile registers r14-r31 */



[PATCH v2 8/8] bpf ppc32: Add addr > TASK_SIZE_MAX explicit check

2021-09-17 Thread Hari Bathini
With KUAP enabled, any kernel code which wants to access userspace
needs to be surrounded by disable-enable KUAP. But that is not
happening for BPF_PROBE_MEM load instruction. Though PPC32 does not
support read protection, considering the fact that PTR_TO_BTF_ID
(which uses BPF_PROBE_MEM mode) could either be a valid kernel pointer
or NULL but should never be a pointer to userspace address, execute
BPF_PROBE_MEM load only if addr > TASK_SIZE_MAX, otherwise set
dst_reg=0 and move on.

This will catch NULL, valid or invalid userspace pointers. Only bad
kernel pointer will be handled by BPF exception table.

[Alexei suggested for x86]
Suggested-by: Alexei Starovoitov 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to handle bad userspace pointers on PPC32. 


 arch/powerpc/net/bpf_jit_comp32.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index c6262289dcc4..faa8047fcf4a 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -821,6 +821,45 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/* dst = *(u64 *)(ul) (src + off) */
case BPF_LDX | BPF_MEM | BPF_DW:
case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
+   /*
+* As PTR_TO_BTF_ID that uses BPF_PROBE_MEM mode could 
either be a valid
+* kernel pointer or NULL but not a userspace address, 
execute BPF_PROBE_MEM
+* load only if addr > TASK_SIZE_MAX, otherwise set 
dst_reg=0 and move on.
+*/
+   if (BPF_MODE(code) == BPF_PROBE_MEM) {
+   bool extra_insn_needed = false;
+   unsigned int adjusted_idx;
+
+   /*
+* For BPF_DW case, "li reg_h,0" would be 
needed when
+* !fp->aux->verifier_zext. Adjust conditional 
branch'ing
+* address accordingly.
+*/
+   if ((size == BPF_DW) && !fp->aux->verifier_zext)
+   extra_insn_needed = true;
+
+   /*
+* Need to jump two instructions instead of one 
for BPF_DW case
+* as there are two load instructions for 
dst_reg_h & dst_reg
+* respectively.
+*/
+   adjusted_idx = (size == BPF_DW) ? 1 : 0;
+
+   EMIT(PPC_RAW_ADDI(b2p[TMP_REG], src_reg, off));
+   PPC_LI32(_R0, TASK_SIZE_MAX);
+   EMIT(PPC_RAW_CMPLW(b2p[TMP_REG], _R0));
+   PPC_BCC(COND_GT, (ctx->idx + 4 + 
(extra_insn_needed ? 1 : 0)) * 4);
+   EMIT(PPC_RAW_LI(dst_reg, 0));
+   /*
+* Note that "li reg_h,0" is emitted for 
BPF_B/H/W case,
+* if necessary. So, jump there insted of 
emitting an
+* additional "li reg_h,0" instruction.
+*/
+   if (extra_insn_needed)
+   EMIT(PPC_RAW_LI(dst_reg_h, 0));
+   PPC_JMP((ctx->idx + 2 + adjusted_idx) * 4);
+   }
+
switch (size) {
case BPF_B:
EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
-- 
2.31.1



[PATCH v2 7/8] bpf ppc32: Add BPF_PROBE_MEM support for JIT

2021-09-17 Thread Hari Bathini
BPF load instruction with BPF_PROBE_MEM mode can cause a fault
inside kernel. Append exception table for such instructions
within BPF program.

Unlike other archs which uses extable 'fixup' field to pass dest_reg
and nip, BPF exception table on PowerPC follows the generic PowerPC
exception table design, where it populates both fixup and extable
sections within BPF program. fixup section contains 3 instructions,
first 2 instructions clear dest_reg (lower & higher 32-bit registers)
and last instruction jumps to next instruction in the BPF code.
extable 'insn' field contains relative offset of the instruction and
'fixup' field contains relative offset of the fixup entry. Example
layout of BPF program with extable present:

 +--+
 |  |
 |  |
   0x4020 -->| lwz   r28,4(r4)  |
 |  |
 |  |
   0x40ac -->| lwz  r3,0(r24)   |
 | lwz  r4,4(r24)   |
 |  |
 |  |
 |--|
   0x4278 -->| li  r28,0|  \
 | li  r27,0|  | fixup entry
 | b   0x4024   |  /
   0x4284 -->| li  r4,0 |
 | li  r3,0 |
 | b   0x40b4   |
 |--|
   0x4290 -->| insn=0xfd90  |  \ extable entry
 | fixup=0xffe4 |  /
   0x4298 -->| insn=0xfe14  |
 | fixup=0xffe8 |
 +--+

   (Addresses shown here are chosen random, not real)

Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to add BPF_PROBE_MEM support for PPC32.


 arch/powerpc/net/bpf_jit.h|  7 +
 arch/powerpc/net/bpf_jit_comp.c   | 50 +++
 arch/powerpc/net/bpf_jit_comp32.c | 30 +++
 arch/powerpc/net/bpf_jit_comp64.c | 48 ++---
 4 files changed, 89 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 6357c71c26eb..6a591ef88006 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -144,7 +144,11 @@ struct codegen_context {
unsigned int exentry_idx;
 };
 
+#ifdef CONFIG_PPC32
+#define BPF_FIXUP_LEN  12 /* Three instructions */
+#else
 #define BPF_FIXUP_LEN  8 /* Two instructions */
+#endif
 
 static inline void bpf_flush_icache(void *start, void *end)
 {
@@ -174,6 +178,9 @@ void bpf_jit_build_prologue(u32 *image, struct 
codegen_context *ctx);
 void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
 void bpf_jit_realloc_regs(struct codegen_context *ctx);
 
+int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct 
codegen_context *ctx,
+ int insn_idx, int jmp_off, int dst_reg);
+
 #endif
 
 #endif
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index e92bd79d3bac..a1753b8c78c8 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -271,3 +271,53 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 
return fp;
 }
+
+/*
+ * The caller should check for (BPF_MODE(code) == BPF_PROBE_MEM) before calling
+ * this function, as this only applies to BPF_PROBE_MEM, for now.
+ */
+int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct 
codegen_context *ctx,
+ int insn_idx, int jmp_off, int dst_reg)
+{
+   off_t offset;
+   unsigned long pc;
+   struct exception_table_entry *ex;
+   u32 *fixup;
+
+   /* Populate extable entries only in the last pass */
+   if (pass != 2)
+   return 0;
+
+   if (!fp->aux->extable ||
+   WARN_ON_ONCE(ctx->exentry_idx >= fp->aux->num_exentries))
+   return -EINVAL;
+
+   pc = (unsigned long)[insn_idx];
+
+   fixup = (void *)fp->aux->extable -
+   (fp->aux->num_exentries * BPF_FIXUP_LEN) +
+   (ctx->exentry_idx * BPF_FIXUP_LEN);
+
+   fixup[0] = PPC_RAW_LI(dst_reg, 0);
+#ifdef CONFIG_PPC32
+   fixup[1] = PPC_RAW_LI(dst_reg - 1, 0); /* clear higher 32-bit register 
too */
+   fixup[2] = PPC_RAW_BRANCH((long)(pc + jmp_off) - (long)[2]);
+#else
+   fixup[1] = PPC_RAW_BRANCH((long)(pc + jmp_off) - (long)[1]);
+#endif
+
+   ex = >aux->extable[ctx->exentry_idx];
+
+   offset = pc - (long)>insn;
+   if (WARN_ON_ONCE(offset >= 0 || offset < INT_MIN))
+   return -ERANGE;
+   ex->insn = offset;
+
+   offset = (long)fixup - (long)>fixup;
+   if (WARN_ON_ONCE(offset >= 0 || offset < INT_MIN))
+   return -ERANGE;
+   ex->fixup = offset;
+
+   ctx->exentry_idx++;
+   return 0;
+}
diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index 94641b7be387..c6262289dcc4 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c

[PATCH v2 3/8] bpf powerpc: refactor JIT compiler code

2021-09-17 Thread Hari Bathini
Refactor powerpc JITing. This simplifies adding BPF_PROBE_MEM support.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to refactor a bit of JITing code.


 arch/powerpc/net/bpf_jit_comp32.c | 50 +++-
 arch/powerpc/net/bpf_jit_comp64.c | 76 ---
 2 files changed, 68 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index b60b59426a24..c8ae14c316e3 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -276,17 +276,17 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
u32 exit_addr = addrs[flen];
 
for (i = 0; i < flen; i++) {
-   u32 code = insn[i].code;
u32 dst_reg = bpf_to_ppc(ctx, insn[i].dst_reg);
-   u32 dst_reg_h = dst_reg - 1;
u32 src_reg = bpf_to_ppc(ctx, insn[i].src_reg);
-   u32 src_reg_h = src_reg - 1;
u32 tmp_reg = bpf_to_ppc(ctx, TMP_REG);
+   u32 true_cond, code = insn[i].code;
+   u32 dst_reg_h = dst_reg - 1;
+   u32 src_reg_h = src_reg - 1;
+   u32 size = BPF_SIZE(code);
s16 off = insn[i].off;
s32 imm = insn[i].imm;
bool func_addr_fixed;
u64 func_addr;
-   u32 true_cond;
 
/*
 * addrs[] maps a BPF bytecode address into a real offset from
@@ -809,25 +809,33 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/*
 * BPF_LDX
 */
-   case BPF_LDX | BPF_MEM | BPF_B: /* dst = *(u8 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   break;
-   case BPF_LDX | BPF_MEM | BPF_H: /* dst = *(u16 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   break;
-   case BPF_LDX | BPF_MEM | BPF_W: /* dst = *(u32 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
-   if (!fp->aux->verifier_zext)
+   /* dst = *(u8 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_B:
+   /* dst = *(u16 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_H:
+   /* dst = *(u32 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_W:
+   /* dst = *(u64 *)(ul) (src + off) */
+   case BPF_LDX | BPF_MEM | BPF_DW:
+   switch (size) {
+   case BPF_B:
+   EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
+   break;
+   case BPF_H:
+   EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
+   break;
+   case BPF_W:
+   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
+   break;
+   case BPF_DW:
+   EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
+   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
+   break;
+   }
+
+   if ((size != BPF_DW) && !fp->aux->verifier_zext)
EMIT(PPC_RAW_LI(dst_reg_h, 0));
break;
-   case BPF_LDX | BPF_MEM | BPF_DW: /* dst = *(u64 *)(ul) (src + 
off) */
-   EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
-   EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
-   break;
 
/*
 * Doubleword load
diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 2a87da50d9a4..78b28f2c 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -282,16 +282,15 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
u32 exit_addr = addrs[flen];
 
for (i = 0; i < flen; i++) {
-   u32 code = insn[i].code;
u32 dst_reg = b2p[insn[i].dst_reg];
u32 src_reg = b2p[insn[i].src_reg];
+   u32 tmp_idx, code = insn[i].code;
+   u32 size = BPF_SIZE(code);
s16 off = insn[i].off;
s32 imm = insn[i].imm;
bool func_addr_fixed;
-   u64 func_addr;
-   u64 imm64;
+   u64 func_addr, imm64;
u32 true_cond;
- 

[PATCH v2 6/8] bpf ppc64: Add addr > TASK_SIZE_MAX explicit check

2021-09-17 Thread Hari Bathini
From: Ravi Bangoria 

On PPC64 with KUAP enabled, any kernel code which wants to
access userspace needs to be surrounded by disable-enable KUAP.
But that is not happening for BPF_PROBE_MEM load instruction.
So, when BPF program tries to access invalid userspace address,
page-fault handler considers it as bad KUAP fault:

  Kernel attempted to read user page (d000) - exploit attempt? (uid: 0)

Considering the fact that PTR_TO_BTF_ID (which uses BPF_PROBE_MEM
mode) could either be a valid kernel pointer or NULL but should
never be a pointer to userspace address, execute BPF_PROBE_MEM load
only if addr > TASK_SIZE_MAX, otherwise set dst_reg=0 and move on.

This will catch NULL, valid or invalid userspace pointers. Only bad
kernel pointer will be handled by BPF exception table.

[Alexei suggested for x86]
Suggested-by: Alexei Starovoitov 
Signed-off-by: Ravi Bangoria 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Refactored the code based on Christophe's comments.


 arch/powerpc/net/bpf_jit_comp64.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 2fc10995f243..eb28dbc67151 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -769,6 +769,29 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
/* dst = *(u64 *)(ul) (src + off) */
case BPF_LDX | BPF_MEM | BPF_DW:
case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
+   /*
+* As PTR_TO_BTF_ID that uses BPF_PROBE_MEM mode could 
either be a valid
+* kernel pointer or NULL but not a userspace address, 
execute BPF_PROBE_MEM
+* load only if addr > TASK_SIZE_MAX, otherwise set 
dst_reg=0 and move on.
+*/
+   if (BPF_MODE(code) == BPF_PROBE_MEM) {
+   unsigned int adjusted_idx;
+
+   /*
+* Check if 'off' is word aligned because 
PPC_BPF_LL()
+* (BPF_DW case) generates two instructions if 
'off' is not
+* word-aligned and one instruction otherwise.
+*/
+   adjusted_idx = ((BPF_SIZE(code) == BPF_DW) && 
(off & 3)) ? 1 : 0;
+
+   EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], src_reg, 
off));
+   PPC_LI64(b2p[TMP_REG_2], TASK_SIZE_MAX);
+   EMIT(PPC_RAW_CMPLD(b2p[TMP_REG_1], 
b2p[TMP_REG_2]));
+   PPC_BCC(COND_GT, (ctx->idx + 4) * 4);
+   EMIT(PPC_RAW_LI(dst_reg, 0));
+   PPC_JMP((ctx->idx + 2 + adjusted_idx) * 4);
+   }
+
switch (size) {
case BPF_B:
EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
-- 
2.31.1



[PATCH v2 5/8] bpf ppc64: Add BPF_PROBE_MEM support for JIT

2021-09-17 Thread Hari Bathini
From: Ravi Bangoria 

BPF load instruction with BPF_PROBE_MEM mode can cause a fault
inside kernel. Append exception table for such instructions
within BPF program.

Unlike other archs which uses extable 'fixup' field to pass dest_reg
and nip, BPF exception table on PowerPC follows the generic PowerPC
exception table design, where it populates both fixup and extable
sections within BPF program. fixup section contains two instructions,
first instruction clears dest_reg and 2nd jumps to next instruction
in the BPF code. extable 'insn' field contains relative offset of
the instruction and 'fixup' field contains relative offset of the
fixup entry. Example layout of BPF program with extable present:

 +--+
 |  |
 |  |
   0x4020 -->| ld   r27,4(r3)   |
 |  |
 |  |
   0x40ac -->| lwz  r3,0(r4)|
 |  |
 |  |
 |--|
   0x4280 -->| li  r27,0|  \ fixup entry
 | b   0x4024   |  /
   0x4288 -->| li  r3,0 |
 | b   0x40b0   |
 |--|
   0x4290 -->| insn=0xfd90  |  \ extable entry
 | fixup=0xffec |  /
   0x4298 -->| insn=0xfe14  |
 | fixup=0xffec |
 +--+

   (Addresses shown here are chosen random, not real)

Signed-off-by: Ravi Bangoria 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Used JITing code after refactoring.
* Replaced 'xor reg,reg,reg' with 'li reg,0' where appropriate.
* Avoided unnecessary init during declaration.


 arch/powerpc/net/bpf_jit.h|  5 ++-
 arch/powerpc/net/bpf_jit_comp.c   | 25 ++
 arch/powerpc/net/bpf_jit_comp32.c |  2 +-
 arch/powerpc/net/bpf_jit_comp64.c | 57 ++-
 4 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 0c8f885b8f48..6357c71c26eb 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -141,8 +141,11 @@ struct codegen_context {
unsigned int idx;
unsigned int stack_size;
int b2p[ARRAY_SIZE(b2p)];
+   unsigned int exentry_idx;
 };
 
+#define BPF_FIXUP_LEN  8 /* Two instructions */
+
 static inline void bpf_flush_icache(void *start, void *end)
 {
smp_wmb();  /* smp write barrier */
@@ -166,7 +169,7 @@ static inline void bpf_clear_seen_register(struct 
codegen_context *ctx, int i)
 
 void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 
func);
 int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context 
*ctx,
-  u32 *addrs);
+  u32 *addrs, int pass);
 void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
 void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
 void bpf_jit_realloc_regs(struct codegen_context *ctx);
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index c5c9e8ad1de7..e92bd79d3bac 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -101,6 +101,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
struct bpf_prog *tmp_fp;
bool bpf_blinded = false;
bool extra_pass = false;
+   u32 extable_len;
+   u32 fixup_len;
 
if (!fp->jit_requested)
return org_fp;
@@ -131,7 +133,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
image = jit_data->image;
bpf_hdr = jit_data->header;
proglen = jit_data->proglen;
-   alloclen = proglen + FUNCTION_DESCR_SIZE;
extra_pass = true;
goto skip_init_ctx;
}
@@ -149,7 +150,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
 
/* Scouting faux-generate pass 0 */
-   if (bpf_jit_build_body(fp, 0, , addrs)) {
+   if (bpf_jit_build_body(fp, 0, , addrs, 0)) {
/* We hit something illegal or unsupported. */
fp = org_fp;
goto out_addrs;
@@ -162,7 +163,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 */
if (cgctx.seen & SEEN_TAILCALL) {
cgctx.idx = 0;
-   if (bpf_jit_build_body(fp, 0, , addrs)) {
+   if (bpf_jit_build_body(fp, 0, , addrs, 0)) {
fp = org_fp;
goto out_addrs;
}
@@ -177,8 +178,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
bpf_jit_build_prologue(0, );
bpf_jit_build_epilogue(0, );
 
+   fixup_len = fp->aux->num_exentries * BPF_FIXUP_LEN;
+   extable_len = fp->aux->num_exentries * sizeof(struct 
exception_table_entry);
+
proglen = 

[PATCH v2 4/8] powerpc/ppc-opcode: introduce PPC_RAW_BRANCH() macro

2021-09-17 Thread Hari Bathini
Define and use PPC_RAW_BRANCH() macro instead of open coding it. This
macro is used while adding BPF_PROBE_MEM support.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to introduce PPC_RAW_BRANCH() macro.


 arch/powerpc/include/asm/ppc-opcode.h | 2 ++
 arch/powerpc/net/bpf_jit.h| 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index baea657bc868..f50213e2a3e0 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -566,6 +566,8 @@
 #define PPC_RAW_MTSPR(spr, d)  (0x7c0003a6 | ___PPC_RS(d) | 
__PPC_SPR(spr))
 #define PPC_RAW_EIEIO()(0x7c0006ac)
 
+#define PPC_RAW_BRANCH(addr)   (PPC_INST_BRANCH | ((addr) & 
0x03fc))
+
 /* Deal with instructions that older assemblers aren't aware of */
 #definePPC_BCCTR_FLUSH stringify_in_c(.long 
PPC_INST_BCCTR_FLUSH)
 #definePPC_CP_ABORTstringify_in_c(.long PPC_RAW_CP_ABORT)
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 411c63d945c7..0c8f885b8f48 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -24,8 +24,8 @@
 #define EMIT(instr)PLANT_INSTR(image, ctx->idx, instr)
 
 /* Long jump; (unconditional 'branch') */
-#define PPC_JMP(dest)  EMIT(PPC_INST_BRANCH |\
-(((dest) - (ctx->idx * 4)) & 0x03fc))
+#define PPC_JMP(dest)  EMIT(PPC_RAW_BRANCH((dest) - (ctx->idx * 4)))
+
 /* blr; (unconditional 'branch' with link) to absolute address */
 #define PPC_BL_ABS(dest)   EMIT(PPC_INST_BL |\
 (((dest) - (unsigned long)(image + 
ctx->idx)) & 0x03fc))
-- 
2.31.1



[PATCH v2 2/8] bpf powerpc: Remove extra_pass from bpf_jit_build_body()

2021-09-17 Thread Hari Bathini
From: Ravi Bangoria 

In case of extra_pass, usual JIT passes are always skipped. So,
extra_pass is always false while calling bpf_jit_build_body() and
thus it can be removed.

Signed-off-by: Ravi Bangoria 
---

Changes in v2:
* Updated the changelog wording a bit.


 arch/powerpc/net/bpf_jit.h| 2 +-
 arch/powerpc/net/bpf_jit_comp.c   | 6 +++---
 arch/powerpc/net/bpf_jit_comp32.c | 4 ++--
 arch/powerpc/net/bpf_jit_comp64.c | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index d6267e93027a..411c63d945c7 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -166,7 +166,7 @@ static inline void bpf_clear_seen_register(struct 
codegen_context *ctx, int i)
 
 void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 
func);
 int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context 
*ctx,
-  u32 *addrs, bool extra_pass);
+  u32 *addrs);
 void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
 void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
 void bpf_jit_realloc_regs(struct codegen_context *ctx);
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 53aefee3fe70..c5c9e8ad1de7 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -149,7 +149,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
 
/* Scouting faux-generate pass 0 */
-   if (bpf_jit_build_body(fp, 0, , addrs, false)) {
+   if (bpf_jit_build_body(fp, 0, , addrs)) {
/* We hit something illegal or unsupported. */
fp = org_fp;
goto out_addrs;
@@ -162,7 +162,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 */
if (cgctx.seen & SEEN_TAILCALL) {
cgctx.idx = 0;
-   if (bpf_jit_build_body(fp, 0, , addrs, false)) {
+   if (bpf_jit_build_body(fp, 0, , addrs)) {
fp = org_fp;
goto out_addrs;
}
@@ -210,7 +210,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
/* Now build the prologue, body code & epilogue for real. */
cgctx.idx = 0;
bpf_jit_build_prologue(code_base, );
-   bpf_jit_build_body(fp, code_base, , addrs, extra_pass);
+   bpf_jit_build_body(fp, code_base, , addrs);
bpf_jit_build_epilogue(code_base, );
 
if (bpf_jit_enable > 1)
diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index beb12cbc8c29..b60b59426a24 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -266,7 +266,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct 
codegen_context *ctx, u32
 
 /* Assemble the body code between the prologue & epilogue */
 int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context 
*ctx,
-  u32 *addrs, bool extra_pass)
+  u32 *addrs)
 {
const struct bpf_insn *insn = fp->insnsi;
int flen = fp->len;
@@ -860,7 +860,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
case BPF_JMP | BPF_CALL:
ctx->seen |= SEEN_FUNC;
 
-   ret = bpf_jit_get_func_addr(fp, [i], extra_pass,
+   ret = bpf_jit_get_func_addr(fp, [i], false,
_addr, 
_addr_fixed);
if (ret < 0)
return ret;
diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index b87a63dba9c8..2a87da50d9a4 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -272,7 +272,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct 
codegen_context *ctx, u32
 
 /* Assemble the body code between the prologue & epilogue */
 int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context 
*ctx,
-  u32 *addrs, bool extra_pass)
+  u32 *addrs)
 {
const struct bpf_insn *insn = fp->insnsi;
int flen = fp->len;
@@ -769,7 +769,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
case BPF_JMP | BPF_CALL:
ctx->seen |= SEEN_FUNC;
 
-   ret = bpf_jit_get_func_addr(fp, [i], extra_pass,
+   ret = bpf_jit_get_func_addr(fp, [i], false,
_addr, 
_addr_fixed);
if (ret < 0)
return ret;
-- 
2.31.1



[PATCH v2 1/8] bpf powerpc: Remove unused SEEN_STACK

2021-09-17 Thread Hari Bathini
From: Ravi Bangoria 

SEEN_STACK is unused on PowerPC. Remove it. Also, have
SEEN_TAILCALL use 0x4000.

Signed-off-by: Ravi Bangoria 
---

* No changes in v2.


 arch/powerpc/net/bpf_jit.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 99fad093f43e..d6267e93027a 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -116,8 +116,7 @@ static inline bool is_nearbranch(int offset)
 #define COND_LE(CR0_GT | COND_CMP_FALSE)
 
 #define SEEN_FUNC  0x2000 /* might call external helpers */
-#define SEEN_STACK 0x4000 /* uses BPF stack */
-#define SEEN_TAILCALL  0x8000 /* uses tail calls */
+#define SEEN_TAILCALL  0x4000 /* uses tail calls */
 
 #define SEEN_VREG_MASK 0x1ff8 /* Volatile registers r3-r12 */
 #define SEEN_NVREG_MASK0x0003 /* Non volatile registers r14-r31 */
-- 
2.31.1



[PATCH v2 0/8] bpf powerpc: Add BPF_PROBE_MEM support in powerpc JIT compiler

2021-09-17 Thread Hari Bathini
Patch #1 & #2 are simple cleanup patches. Patch #3 refactors JIT
compiler code with the aim to simplify adding BPF_PROBE_MEM support.
Patch #4 introduces PPC_RAW_BRANCH() macro instead of open coding
branch instruction. Patch #5 & #7 add BPF_PROBE_MEM support for PPC64
& PPC32 JIT compilers respectively. Patch #6 & #8 add explicit
addr > TASK_SIZE_MAX check to handle bad userspace pointers for
PPC64 & PPC32 cases respectively.

Link to v1 posted by Ravi:

  
https://lore.kernel.org/all/20210706073211.349889-1-ravi.bango...@linux.ibm.com/
  ("[PATCH 0/4] bpf powerpc: Add BPF_PROBE_MEM support for 64bit JIT")


Hari Bathini (4):
  bpf powerpc: refactor JIT compiler code
  powerpc/ppc-opcode: introduce PPC_RAW_BRANCH() macro
  bpf ppc32: Add BPF_PROBE_MEM support for JIT
  bpf ppc32: Add addr > TASK_SIZE_MAX explicit check

Ravi Bangoria (4):
  bpf powerpc: Remove unused SEEN_STACK
  bpf powerpc: Remove extra_pass from bpf_jit_build_body()
  bpf ppc64: Add BPF_PROBE_MEM support for JIT
  bpf ppc64: Add addr > TASK_SIZE_MAX explicit check

 arch/powerpc/include/asm/ppc-opcode.h |   2 +
 arch/powerpc/net/bpf_jit.h|  19 ++--
 arch/powerpc/net/bpf_jit_comp.c   |  75 ++--
 arch/powerpc/net/bpf_jit_comp32.c | 123 +-
 arch/powerpc/net/bpf_jit_comp64.c | 114 
 5 files changed, 260 insertions(+), 73 deletions(-)

-- 
2.31.1



Re: [PATCH v5 5/6] sched/fair: Carve out logic to mark a group for asymmetric packing

2021-09-17 Thread Vincent Guittot
On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> Create a separate function, sched_asym(). A subsequent changeset will
> introduce logic to deal with SMT in conjunction with asmymmetric
> packing. Such logic will need the statistics of the scheduling
> group provided as argument. Update them before calling sched_asym().
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Co-developed-by: Peter Zijlstra (Intel) 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * Remove a redundant check for the local group in sched_asym().
> (Dietmar)
>   * Reworded commit message for clarity. (Len)
>
> Changes since v2:
>   * Introduced this patch.
>
> Changes since v1:
>   * N/A
> ---
>  kernel/sched/fair.c | 20 +---
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c5851260b4d8..26db017c14a3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8597,6 +8597,13 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
>  }
>
> +static inline bool
> +sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats 
> *sgs,
> +  struct sched_group *group)
> +{
> +   return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> +}
> +
>  /**
>   * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -8657,18 +8664,17 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
> }
> }
>
> +   sgs->group_capacity = group->sgc->capacity;
> +
> +   sgs->group_weight = group->group_weight;
> +
> /* Check if dst CPU is idle and preferred to this group */
> if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
> -   env->idle != CPU_NOT_IDLE &&
> -   sgs->sum_h_nr_running &&
> -   sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> +   env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
> +   sched_asym(env, sds, sgs, group)) {
> sgs->group_asym_packing = 1;
> }
>
> -   sgs->group_capacity = group->sgc->capacity;
> -
> -   sgs->group_weight = group->group_weight;
> -
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> /* Computing avg_load makes sense only when group is overloaded */
> --
> 2.17.1
>


Re: [PATCH v5 4/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics

2021-09-17 Thread Vincent Guittot
On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> Before deciding to pull tasks when using asymmetric packing of tasks,
> on some architectures (e.g., x86) it is necessary to know not only the
> state of dst_cpu but also of its SMT siblings. The decision to classify
> a candidate busiest group as group_asym_packing is done in
> update_sg_lb_stats(). Give this function access to the scheduling domain
> statistics, which contains the statistics of the local group.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Originally-by: Peter Zijlstra (Intel) 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * None
>
> Changes since v2:
>   * Introduced this patch.
>
> Changes since v1:
>   * N/A
> ---
>  kernel/sched/fair.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a054f528bcc..c5851260b4d8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8605,6 +8605,7 @@ group_type group_classify(unsigned int imbalance_pct,
>   * @sg_status: Holds flag indicating the status of the sched_group
>   */
>  static inline void update_sg_lb_stats(struct lb_env *env,
> + struct sd_lb_stats *sds,
>   struct sched_group *group,
>   struct sg_lb_stats *sgs,
>   int *sg_status)
> @@ -8613,7 +8614,7 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>
> memset(sgs, 0, sizeof(*sgs));
>
> -   local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
> +   local_group = group == sds->local;
>
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> @@ -9176,7 +9177,7 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
> update_group_capacity(env->sd, env->dst_cpu);
> }
>
> -   update_sg_lb_stats(env, sg, sgs, _status);
> +   update_sg_lb_stats(env, sds, sg, sgs, _status);
>
> if (local_group)
> goto next_group;
> --
> 2.17.1
>


Re: [PATCH v5 3/6] sched/fair: Optimize checking for group_asym_packing

2021-09-17 Thread Vincent Guittot
On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> sched_asmy_prefer() always returns false when called on the local group. By
> checking local_group, we can avoid additional checks and invoking
> sched_asmy_prefer() when it is not needed. No functional changes are
> introduced.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * Further rewording of the commit message. (Len)
>
> Changes since v2:
>   * Reworded the commit message for clarity. (Peter Z)
>
> Changes since v1:
>   * None
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..7a054f528bcc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8657,7 +8657,7 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
> }
>
> /* Check if dst CPU is idle and preferred to this group */
> -   if (env->sd->flags & SD_ASYM_PACKING &&
> +   if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
> env->idle != CPU_NOT_IDLE &&
> sgs->sum_h_nr_running &&
> sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> --
> 2.17.1
>


Re: [PATCH v5 2/6] sched/topology: Introduce sched_group::flags

2021-09-17 Thread Vincent Guittot
On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> There exist situations in which the load balance needs to know the
> properties of the CPUs in a scheduling group. When using asymmetric
> packing, for instance, the load balancer needs to know not only the
> state of dst_cpu but also of its SMT siblings, if any.
>
> Use the flags of the child scheduling domains to initialize scheduling
> group flags. This will reflect the properties of the CPUs in the
> group.
>
> A subsequent changeset will make use of these new flags. No functional
> changes are introduced.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Originally-by: Peter Zijlstra (Intel) 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * Clear the flags of the scheduling groups of a domain if its child is
> destroyed.
>   * Minor rewording of the commit message.
>
> Changes since v2:
>   * Introduced this patch.
>
> Changes since v1:
>   * N/A
> ---
>  kernel/sched/sched.h|  1 +
>  kernel/sched/topology.c | 21 ++---
>  2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3d3e5793e117..86ab33ce529d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1809,6 +1809,7 @@ struct sched_group {
> unsigned intgroup_weight;
> struct sched_group_capacity *sgc;
> int asym_prefer_cpu;/* CPU of highest 
> priority in group */
> +   int flags;
>
> /*
>  * The CPUs this group covers.
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 4e8698e62f07..c56faae461d9 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -716,8 +716,20 @@ cpu_attach_domain(struct sched_domain *sd, struct 
> root_domain *rd, int cpu)
> tmp = sd;
> sd = sd->parent;
> destroy_sched_domain(tmp);
> -   if (sd)
> +   if (sd) {
> +   struct sched_group *sg = sd->groups;
> +
> +   /*
> +* sched groups hold the flags of the child sched
> +* domain for convenience. Clear such flags since
> +* the child is being destroyed.
> +*/
> +   do {
> +   sg->flags = 0;
> +   } while (sg != sd->groups);
> +
> sd->child = NULL;
> +   }
> }
>
> for (tmp = sd; tmp; tmp = tmp->parent)
> @@ -916,10 +928,12 @@ build_group_from_child_sched_domain(struct sched_domain 
> *sd, int cpu)
> return NULL;
>
> sg_span = sched_group_span(sg);
> -   if (sd->child)
> +   if (sd->child) {
> cpumask_copy(sg_span, sched_domain_span(sd->child));
> -   else
> +   sg->flags = sd->child->flags;
> +   } else {
> cpumask_copy(sg_span, sched_domain_span(sd));
> +   }
>
> atomic_inc(>ref);
> return sg;
> @@ -1169,6 +1183,7 @@ static struct sched_group *get_group(int cpu, struct 
> sd_data *sdd)
> if (child) {
> cpumask_copy(sched_group_span(sg), sched_domain_span(child));
> cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
> +   sg->flags = child->flags;
> } else {
> cpumask_set_cpu(cpu, sched_group_span(sg));
> cpumask_set_cpu(cpu, group_balance_mask(sg));
> --
> 2.17.1
>


Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-17 Thread Vincent Guittot
On Fri, 17 Sept 2021 at 03:01, Ricardo Neri
 wrote:
>
> On Wed, Sep 15, 2021 at 05:43:44PM +0200, Vincent Guittot wrote:
> > On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
> >  wrote:
> > >
> > > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > > check for the idle state of the destination CPU, dst_cpu, but also of
> > > its SMT siblings.
> > >
> > > If dst_cpu is idle but its SMT siblings are busy, performance suffers
> > > if it pulls tasks from a medium priority CPU that does not have SMT
> > > siblings.
> > >
> > > Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> > > siblings of both dst_cpu and the CPUs in the candidate busiest group.
> > >
> > > Cc: Aubrey Li 
> > > Cc: Ben Segall 
> > > Cc: Daniel Bristot de Oliveira 
> > > Cc: Dietmar Eggemann 
> > > Cc: Mel Gorman 
> > > Cc: Quentin Perret 
> > > Cc: Rafael J. Wysocki 
> > > Cc: Srinivas Pandruvada 
> > > Cc: Steven Rostedt 
> > > Cc: Tim Chen 
> > > Reviewed-by: Joel Fernandes (Google) 
> > > Reviewed-by: Len Brown 
> > > Signed-off-by: Ricardo Neri 
> > > ---
> > > Changes since v4:
> > >   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> > > (Vincent, Peter)
> > >   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
> > >   * Updated function documentation and corrected a typo.
> > >
> > > Changes since v3:
> > >   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> > > powerpc folks showed that this patch should not impact them. Also, 
> > > more
> > > recent powerpc processor no longer use asym_packing. (PeterZ)
> > >   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> > >   * Removed unnecessary check for local CPUs when the local group has zero
> > > utilization. (Joel)
> > >   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> > > the fact that it deals with SMT cases.
> > >   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> > > that callers can deal with non-SMT cases.
> > >
> > > Changes since v2:
> > >   * Reworded the commit message to reflect updates in code.
> > >   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> > > balancing. (PeterZ)
> > >   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> > > sched_asym().
> > >
> > > Changes since v1:
> > >   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> > > tasks. Instead, reclassify the candidate busiest group, as it
> > > may still be selected. (PeterZ)
> > >   * Avoid an expensive and unnecessary call to cpumask_weight() when
> > > determining if a sched_group is comprised of SMT siblings.
> > > (PeterZ).
> > > ---
> > >  kernel/sched/fair.c | 94 +
> > >  1 file changed, 94 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 26db017c14a3..8d763dd0174b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int 
> > > imbalance_pct,
> > > return group_has_spare;
> > >  }
> > >
> > > +/**
> > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > pull tasks
> > > + * @dst_cpu:   Destination CPU of the load balancing
> > > + * @sds:   Load-balancing data with statistics of the local group
> > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > + * @sg:The candidate busiest group
> > > + *
> > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > decide
> > > + * if @dst_cpu can pull tasks.
> > > + *
> > > + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or 
> > > more of
> > > + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, 
> > > pull tasks
> > > + * only if @dst_cpu has higher priority.
> > > + *
> > > + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one 
> > > more
> > > + * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
> > > priority.
> > > + * Bigger imbalances in the number of busy CPUs will be dealt with in
> > > + * update_sd_pick_busiest().
> > > + *
> > > + * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
> > > siblings
> > > + * of @dst_cpu are idle and @sg has lower priority.
> > > + */
> > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > > +   struct sg_lb_stats *sgs,
> > > +   struct sched_group *sg)
> > > +{
> > > +#ifdef CONFIG_SCHED_SMT
> > > +   bool local_is_smt, sg_is_smt;
> > > +   int sg_busy_cpus;
> > > +
> > > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > > +
> > > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;

RE: [PATCH v2] powerpc/32: Don't use a struct based type for pte_t

2021-09-17 Thread David Laight
From: Christophe Leroy
> Sent: 17 September 2021 14:58
> 
> Long time ago we had a config item called STRICT_MM_TYPECHECKS
> to build the kernel with pte_t defined as a structure in order
> to perform additional build checks or build it with pte_t
> defined as a simple type in order to get simpler generated code.
> 
...
> diff --git a/arch/powerpc/include/asm/pgtable-types.h 
> b/arch/powerpc/include/asm/pgtable-types.h
> index d11b4c61d686..c60199fc6fa6 100644
> --- a/arch/powerpc/include/asm/pgtable-types.h
> +++ b/arch/powerpc/include/asm/pgtable-types.h
> @@ -5,14 +5,26 @@
>  /* PTE level */
>  #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
>  typedef struct { pte_basic_t pte, pte1, pte2, pte3; } pte_t;
> -#else
> +#elif defined(__CHECKER__) || !defined(CONFIG_PPC32)
>  typedef struct { pte_basic_t pte; } pte_t;
> +#else
> +typedef pte_basic_t pte_t;
>  #endif
> +
> +#if defined(__CHECKER__) || !defined(CONFIG_PPC32) || \
> +(defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES))
>  #define __pte(x) ((pte_t) { (x) })
>  static inline pte_basic_t pte_val(pte_t x)
>  {
>   return x.pte;
>  }
> +#else
> +#define __pte(x) ((pte_t)(x))
> +static inline pte_basic_t pte_val(pte_t x)
> +{
> + return x;
> +}
> +#endif

Would it be better to define:
static inline pte_basic_*pte_basic(pte_t *x)
{
#if xxx
return x;
#else
return >pte;
#endif
}

Then pte_val(x) is always *pt_basic(x)
and the casts like:

> - pte_basic_t *entry = >pte;
> + pte_basic_t *entry = (pte_basic_t *)ptep;

can go away.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[PATCH v2] powerpc/32: Don't use a struct based type for pte_t

2021-09-17 Thread Christophe Leroy
Long time ago we had a config item called STRICT_MM_TYPECHECKS
to build the kernel with pte_t defined as a structure in order
to perform additional build checks or build it with pte_t
defined as a simple type in order to get simpler generated code.

Commit 670eea924198 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS")
made the struct based definition the only one, considering that the
generated code was similar in both cases.

That's right on ppc64 because the ABI is such that the content of a
struct having a single simple type element is passed as register,
but on ppc32 such a structure is passed via the stack like any
structure.

Simple test function:

pte_t test(pte_t pte)
{
return pte;
}

Before this patch we get

c00108ec :
c00108ec:   81 24 00 00 lwz r9,0(r4)
c00108f0:   91 23 00 00 stw r9,0(r3)
c00108f4:   4e 80 00 20 blr

So, for PPC32, restore the simple type behaviour we got before
commit 670eea924198, but instead of adding a config option to
activate type check, do it when __CHECKER__ is set so that type
checking is performed by 'sparse' and provides feedback like:

arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return 
expression (different base types)
arch/powerpc/mm/pgtable.c:466:16:expected unsigned long
arch/powerpc/mm/pgtable.c:466:16:got struct pte_t [usertype] x

With this patch we now get

c0010890 :
c0010890:   4e 80 00 20 blr

Signed-off-by: Christophe Leroy 
---
v2: Properly handle 8xx 16k pages
---
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 +-
 arch/powerpc/include/asm/pgtable-types.h | 14 +-
 arch/powerpc/mm/pgtable.c|  2 +-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index f06ae00f2a65..34ce50da1850 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -245,7 +245,7 @@ static int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t 
val, int huge)
 static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, 
pte_t *p,
 unsigned long clr, unsigned long set, int 
huge)
 {
-   pte_basic_t *entry = >pte;
+   pte_basic_t *entry = (pte_basic_t *)p;
pte_basic_t old = pte_val(*p);
pte_basic_t new = (old & ~(pte_basic_t)clr) | set;
int num, i;
diff --git a/arch/powerpc/include/asm/pgtable-types.h 
b/arch/powerpc/include/asm/pgtable-types.h
index d11b4c61d686..c60199fc6fa6 100644
--- a/arch/powerpc/include/asm/pgtable-types.h
+++ b/arch/powerpc/include/asm/pgtable-types.h
@@ -5,14 +5,26 @@
 /* PTE level */
 #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
 typedef struct { pte_basic_t pte, pte1, pte2, pte3; } pte_t;
-#else
+#elif defined(__CHECKER__) || !defined(CONFIG_PPC32)
 typedef struct { pte_basic_t pte; } pte_t;
+#else
+typedef pte_basic_t pte_t;
 #endif
+
+#if defined(__CHECKER__) || !defined(CONFIG_PPC32) || \
+(defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES))
 #define __pte(x)   ((pte_t) { (x) })
 static inline pte_basic_t pte_val(pte_t x)
 {
return x.pte;
 }
+#else
+#define __pte(x)   ((pte_t)(x))
+static inline pte_basic_t pte_val(pte_t x)
+{
+   return x;
+}
+#endif
 
 /* PMD level */
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cd16b407f47e..ce9482383144 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -271,7 +271,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr, pte_t *ptep, pte_
 {
pmd_t *pmd = pmd_off(mm, addr);
pte_basic_t val;
-   pte_basic_t *entry = >pte;
+   pte_basic_t *entry = (pte_basic_t *)ptep;
int num, i;
 
/*
-- 
2.31.1



[PATCH v2] powerpc/8xx: Simplify TLB handling

2021-09-17 Thread Christophe Leroy
In the old days, TLB handling for 8xx was using tlbie and tlbia
instructions directly as much as possible.

But commit f048aace29e0 ("powerpc/mm: Add SMP support to no-hash
TLB handling") broke that by introducing out-of-line unnecessary
complex functions for booke/smp which don't have tlbie/tlbia
instructions and require more complex handling.

Restore direct use of tlbie and tlbia for 8xx which is never SMP.

With this patch we now get

c00ecc68 :
c00ecc68:   39 00 00 00 li  r8,0
c00ecc6c:   81 46 00 00 lwz r10,0(r6)
c00ecc70:   91 06 00 00 stw r8,0(r6)
c00ecc74:   7c 00 2a 64 tlbie   r5,r0
c00ecc78:   7c 00 04 ac hwsync
c00ecc7c:   91 43 00 00 stw r10,0(r3)
c00ecc80:   4e 80 00 20 blr

Before it was

c0012880 :
c0012880:   2c 03 00 00 cmpwi   r3,0
c0012884:   41 82 00 54 beq c00128d8 

c0012888:   81 22 00 00 lwz r9,0(r2)
c001288c:   81 43 00 20 lwz r10,32(r3)
c0012890:   39 29 00 01 addir9,r9,1
c0012894:   91 22 00 00 stw r9,0(r2)
c0012898:   2c 0a 00 00 cmpwi   r10,0
c001289c:   41 82 00 10 beq c00128ac 

c00128a0:   81 2a 01 dc lwz r9,476(r10)
c00128a4:   2c 09 ff ff cmpwi   r9,-1
c00128a8:   41 82 00 0c beq c00128b4 

c00128ac:   7c 00 22 64 tlbie   r4,r0
c00128b0:   7c 00 04 ac hwsync
c00128b4:   81 22 00 00 lwz r9,0(r2)
c00128b8:   39 29 ff ff addir9,r9,-1
c00128bc:   2c 09 00 00 cmpwi   r9,0
c00128c0:   91 22 00 00 stw r9,0(r2)
c00128c4:   4c a2 00 20 bclr+   4,eq
c00128c8:   81 22 00 70 lwz r9,112(r2)
c00128cc:   71 29 00 04 andi.   r9,r9,4
c00128d0:   4d 82 00 20 beqlr
c00128d4:   48 65 76 74 b   c0669f48 
c00128d8:   81 22 00 00 lwz r9,0(r2)
c00128dc:   39 29 00 01 addir9,r9,1
c00128e0:   91 22 00 00 stw r9,0(r2)
c00128e4:   4b ff ff c8 b   c00128ac 

...
c00ecdc8 :
c00ecdc8:   94 21 ff f0 stwur1,-16(r1)
c00ecdcc:   39 20 00 00 li  r9,0
c00ecdd0:   93 c1 00 08 stw r30,8(r1)
c00ecdd4:   83 c6 00 00 lwz r30,0(r6)
c00ecdd8:   91 26 00 00 stw r9,0(r6)
c00ecddc:   93 e1 00 0c stw r31,12(r1)
c00ecde0:   7c 08 02 a6 mflrr0
c00ecde4:   7c 7f 1b 78 mr  r31,r3
c00ecde8:   7c 83 23 78 mr  r3,r4
c00ecdec:   7c a4 2b 78 mr  r4,r5
c00ecdf0:   90 01 00 14 stw r0,20(r1)
c00ecdf4:   4b f2 5a 8d bl  c0012880 
c00ecdf8:   93 df 00 00 stw r30,0(r31)
c00ecdfc:   7f e3 fb 78 mr  r3,r31
c00ece00:   80 01 00 14 lwz r0,20(r1)
c00ece04:   83 c1 00 08 lwz r30,8(r1)
c00ece08:   83 e1 00 0c lwz r31,12(r1)
c00ece0c:   7c 08 03 a6 mtlrr0
c00ece10:   38 21 00 10 addir1,r1,16
c00ece14:   4e 80 00 20 blr

Signed-off-by: Christophe Leroy 
---
v2: Remove tracing of tlbie and tlbia as it generates circular dependency 
problems. Will see later if it can be added back. Not a big issue though.
---
 arch/powerpc/include/asm/nohash/tlbflush.h | 15 +++
 arch/powerpc/mm/nohash/tlb.c   |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/nohash/tlbflush.h 
b/arch/powerpc/include/asm/nohash/tlbflush.h
index 1edb7243e515..c08d25e3e626 100644
--- a/arch/powerpc/include/asm/nohash/tlbflush.h
+++ b/arch/powerpc/include/asm/nohash/tlbflush.h
@@ -32,11 +32,26 @@ extern void flush_tlb_range(struct vm_area_struct *vma, 
unsigned long start,
unsigned long end);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
+#ifdef CONFIG_PPC_8xx
+static inline void local_flush_tlb_mm(struct mm_struct *mm)
+{
+   unsigned int pid = READ_ONCE(mm->context.id);
+
+   if (pid != MMU_NO_CONTEXT)
+   asm volatile ("sync; tlbia; isync" : : : "memory");
+}
+
+static inline void local_flush_tlb_page(struct vm_area_struct *vma, unsigned 
long vmaddr)
+{
+   asm volatile ("tlbie %0; sync" : : "r" (vmaddr) : "memory");
+}
+#else
 extern void local_flush_tlb_mm(struct mm_struct *mm);
 extern void local_flush_tlb_page(struct vm_area_struct *vma, unsigned long 
vmaddr);
 
 extern void __local_flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
   int tsize, int ind);
+#endif
 
 #ifdef 

Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-17 Thread Jan Beulich
On 17.09.2021 11:36, Roman Skakun wrote:
> I use Xen PV display. In my case, PV display backend(Dom0) allocates
> contiguous buffer via DMA-API to
> to implement zero-copy between Dom0 and DomU.

Why does the buffer need to be allocated by Dom0? If it was allocated
by DomU, it could use grants to give the backend direct (zero-copy)
access.

Jan



Re: [PATCH] powerpc: warn on emulation of dcbz instruction

2021-09-17 Thread Benjamin Herrenschmidt
On Thu, 2021-09-16 at 14:36 +, David Laight wrote:
> > Does userspace accesses non-cached memory directly ?
> 
> 
> It probably can if a driver mmaps PCI space directly into user space.
> 
> That certainly works on x86-64.

The posterchild for that is Xorg

Cheers,
Ben.




[PATCH] powerpc/32: Don't use a struct based type for pte_t

2021-09-17 Thread Christophe Leroy
Long time ago we had a config item called STRICT_MM_TYPECHECKS
to build the kernel with pte_t defined as a structure in order
to perform additional build checks or build it with pte_t
defined as a simple type in order to get simpler generated code.

Commit 670eea924198 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS")
made the struct based definition the only one, considering that the
generated code was similar in both cases.

That's right on ppc64 because the ABI is such that the content of a
struct having a single simple type element is passed as register,
but on ppc32 such a structure is passed via the stack like any
structure.

Simple test function:

pte_t test(pte_t pte)
{
return pte;
}

Before this patch we get

c00108ec :
c00108ec:   81 24 00 00 lwz r9,0(r4)
c00108f0:   91 23 00 00 stw r9,0(r3)
c00108f4:   4e 80 00 20 blr

So, for PPC32, restore the simple type behaviour we got before
commit 670eea924198, but instead of adding a config option to
activate type check, do it when __CHECKER__ is set so that type
checking is performed by 'sparse' and provides feedback like:

arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return 
expression (different base types)
arch/powerpc/mm/pgtable.c:466:16:expected unsigned long
arch/powerpc/mm/pgtable.c:466:16:got struct pte_t [usertype] x

With this patch we now get

c0010890 :
c0010890:   4e 80 00 20 blr

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 +-
 arch/powerpc/include/asm/pgtable-types.h | 13 -
 arch/powerpc/mm/pgtable.c|  2 +-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index f06ae00f2a65..34ce50da1850 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -245,7 +245,7 @@ static int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t 
val, int huge)
 static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, 
pte_t *p,
 unsigned long clr, unsigned long set, int 
huge)
 {
-   pte_basic_t *entry = >pte;
+   pte_basic_t *entry = (pte_basic_t *)p;
pte_basic_t old = pte_val(*p);
pte_basic_t new = (old & ~(pte_basic_t)clr) | set;
int num, i;
diff --git a/arch/powerpc/include/asm/pgtable-types.h 
b/arch/powerpc/include/asm/pgtable-types.h
index d11b4c61d686..e5ed4580a62c 100644
--- a/arch/powerpc/include/asm/pgtable-types.h
+++ b/arch/powerpc/include/asm/pgtable-types.h
@@ -5,14 +5,25 @@
 /* PTE level */
 #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
 typedef struct { pte_basic_t pte, pte1, pte2, pte3; } pte_t;
-#else
+#elif defined(__CHECKER__) || !defined(CONFIG_PPC32)
 typedef struct { pte_basic_t pte; } pte_t;
+#else
+typedef pte_basic_t pte_t;
 #endif
+
+#if defined(__CHECKER__) || !defined(CONFIG_PPC32)
 #define __pte(x)   ((pte_t) { (x) })
 static inline pte_basic_t pte_val(pte_t x)
 {
return x.pte;
 }
+#else
+#define __pte(x)   ((pte_t)(x))
+static inline pte_basic_t pte_val(pte_t x)
+{
+   return x;
+}
+#endif
 
 /* PMD level */
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cd16b407f47e..ce9482383144 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -271,7 +271,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr, pte_t *ptep, pte_
 {
pmd_t *pmd = pmd_off(mm, addr);
pte_basic_t val;
-   pte_basic_t *entry = >pte;
+   pte_basic_t *entry = (pte_basic_t *)ptep;
int num, i;
 
/*
-- 
2.31.1



Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-17 Thread Roman Skakun
Hi Jan,

>>>  In order to be sure to catch all uses like this one (including ones
>>>  which make it upstream in parallel to yours), I think you will want
>>>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
>>
>> I don't understand your point. Can you clarify this?
>
> There's a concrete present example: I have a patch pending adding
> another use of IO_TLB_SEGSIZE. This use would need to be replaced
> like you do here in several places. The need for the additional
> replacement would be quite obvious (from a build failure) if you
> renamed the manifest constant. Without renaming, it'll take
> someone running into an issue on a live system, which I consider
> far worse. This is because a simple re-basing of one of the
> patches on top of the other will not point out the need for the
> extra replacement, nor would a test build (with both patches in
> place).

It's reasonable.
I will change the original IO_TLB_SEGSIZE to IO_TLB_DEFAULT_SEGSIZE in the
next patch series.

Thanks.

ср, 15 сент. 2021 г. в 16:50, Jan Beulich :
>
> On 15.09.2021 15:37, Roman Skakun wrote:
> >>> From: Roman Skakun 
> >>>
> >>> It is possible when default IO TLB size is not
> >>> enough to fit a long buffers as described here [1].
> >>>
> >>> This patch makes a way to set this parameter
> >>> using cmdline instead of recompiling a kernel.
> >>>
> >>> [1] https://www.xilinx.com/support/answers/72694.html
> >>
> >>  I'm not convinced the swiotlb use describe there falls under "intended
> >>  use" - mapping a 1280x720 framebuffer in a single chunk?
> >
> > I had the same issue while mapping DMA chuck ~4MB for gem fb when
> > using xen vdispl.
> > I got the next log:
> > [ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> > 3686400 bytes), total 32768 (slots), used 32 (slots)
> >
> > It happened when I tried to map bounce buffer, which has a large size.
> > The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
> > bytes, but we requested 3686400 bytes.
> > When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
> > 2048(IO_TLB_SHIFT) = 4194304bytes).
> > It makes possible to retrieve a bounce buffer for requested size.
> > After changing this value, the problem is gone.
>
> But the question remains: Why does the framebuffer need to be mapped
> in a single giant chunk?
>
> >>  In order to be sure to catch all uses like this one (including ones
> >>  which make it upstream in parallel to yours), I think you will want
> >>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
> >
> > I don't understand your point. Can you clarify this?
>
> There's a concrete present example: I have a patch pending adding
> another use of IO_TLB_SEGSIZE. This use would need to be replaced
> like you do here in several places. The need for the additional
> replacement would be quite obvious (from a build failure) if you
> renamed the manifest constant. Without renaming, it'll take
> someone running into an issue on a live system, which I consider
> far worse. This is because a simple re-basing of one of the
> patches on top of the other will not point out the need for the
> extra replacement, nor would a test build (with both patches in
> place).
>
> Jan
>


-- 
Best Regards, Roman.


[powerpc:merge] BUILD SUCCESS a0175bcd45a4dd0de301dca8fd17d74fcb3ce240

2021-09-17 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: a0175bcd45a4dd0de301dca8fd17d74fcb3ce240  powerpc/ci: Switch GCC 
4.9.4 builds to 5.5.0

elapsed time: 1069m

configs tested: 105
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-20210916
sh  rsk7269_defconfig
powerpc sequoia_defconfig
mips   bmips_be_defconfig
riscv allnoconfig
powerpc mpc85xx_cds_defconfig
s390   zfcpdump_defconfig
powerpc  arches_defconfig
m68k   m5275evb_defconfig
powerpc powernv_defconfig
mips  rm200_defconfig
mips cu1000-neo_defconfig
arm  ep93xx_defconfig
xtensa  audio_kc705_defconfig
xtensa  iss_defconfig
powerpc linkstation_defconfig
shecovec24-romimage_defconfig
mips  malta_defconfig
arm am200epdkit_defconfig
x86_64   randconfig-c001-20210916
arm  randconfig-c002-20210916
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
arc  allyesconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
x86_64   randconfig-a016-20210916
x86_64   randconfig-a013-20210916
x86_64   randconfig-a012-20210916
x86_64   randconfig-a011-20210916
x86_64   randconfig-a014-20210916
x86_64   randconfig-a015-20210916
i386 randconfig-a016-20210916
i386 randconfig-a015-20210916
i386 randconfig-a011-20210916
i386 randconfig-a012-20210916
i386 randconfig-a013-20210916
i386 randconfig-a014-20210916
riscvrandconfig-r042-20210916
s390 randconfig-r044-20210916
arc  randconfig-r043-20210916
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
x86_64rhel-8.3-kselftests
um   x86_64_defconfig
um i386_defconfig
x86_64   allyesconfig
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
riscvrandconfig-c006-20210916
x86_64   randconfig-c007-20210916
mips randconfig-c004-20210916
powerpc  randconfig-c003-20210916
arm  randconfig-c002-20210916
i386 randconfig-c001-20210916
s390 randconfig-c005-20210916
i386 randconfig-a004-20210916
i386 randconfig-a005-20210916
i386 randconfig-a006-20210916
i386 randconfig-a002-20210916
i386 randconfig-a003-20210916
i386 randconfig-a001-20210916
x86_64   randconfig-a002-20210916
x86_64 

Re: [PATCH] powerpc/mce: check if event info is valid

2021-09-17 Thread Ganesh

On 8/6/21 6:53 PM, Ganesh Goudar wrote:


Check if the event info is valid before printing the
event information. When a fwnmi enabled nested kvm guest
hits a machine check exception L0 and L2 would generate
machine check event info, But L1 would not generate any
machine check event info as it won't go through 0x200
vector and prints some unwanted message.

To fix this, 'in_use' variable in machine check event info is
no more in use, rename it to 'valid' and check if the event
information is valid before logging the event information.

without this patch L1 would print following message for
exceptions encountered in L2, as event structure will be
empty in L1.

"Machine Check Exception, Unknown event version 0".

Signed-off-by: Ganesh Goudar 
---


Hi mpe, Any comments on this patch.



Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-17 Thread Robin Murphy

On 2021-09-17 10:36, Roman Skakun wrote:

Hi, Christoph

I use Xen PV display. In my case, PV display backend(Dom0) allocates
contiguous buffer via DMA-API to
to implement zero-copy between Dom0 and DomU.


Well, something's gone badly wrong there - if you have to shadow the 
entire thing in a bounce buffer to import it then it's hardly zero-copy, 
is it? If you want to do buffer sharing the buffer really needs to be 
allocated appropriately to begin with, such that all relevant devices 
can access it directly. That might be something which needs fixing in Xen.


Robin.


When I start Weston under DomU, I got the next log in Dom0:
```
[ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
5.10.0-yocto-standard+ #312
[ 112.575149] Call trace:
[ 112.577666] dump_backtrace+0x0/0x1b0
[ 112.581373] show_stack+0x18/0x70
[ 112.584746] dump_stack+0xd0/0x12c
[ 112.588200] swiotlb_tbl_map_single+0x234/0x360
[ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
[ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
[ 112.601249] dma_map_sg_attrs+0x54/0x60
[ 112.605138] vsp1_du_map_sg+0x30/0x60
[ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
[ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
[ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
[ 112.623362] drm_atomic_helper_commit+0x88/0x390
[ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
[ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
[ 112.637532] drm_ioctl_kernel+0xc4/0x11c
[ 112.641506] drm_ioctl+0x21c/0x460
[ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
[ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
[ 112.653775] do_el0_svc+0x24/0x90
[ 112.657148] el0_svc+0x14/0x20
[ 112.660254] el0_sync_handler+0x1a4/0x1b0
[ 112.664315] el0_sync+0x174/0x180
[ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 65536 (slots), used 112 (slots)
```
The problem is happened here:
https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202

Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
includes a single page chunk
as shown here:
https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18

After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
Internally, vsp1_du_map_sg() using ops->map_sg (e.g
xen_swiotlb_map_sg) to perform
mapping.

I realized that required segment is too big to be fitted to default
swiotlb segment and condition
https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
is always false.

I know that I use a large buffer, but why can't I map this buffer in one chunk?

Thanks!

ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig :


On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:

But the question remains: Why does the framebuffer need to be mapped
in a single giant chunk?


More importantly: if you use dynamic dma mappings for your framebuffer
you're doing something wrong.






Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-17 Thread Roman Skakun
Hi, Christoph

I use Xen PV display. In my case, PV display backend(Dom0) allocates
contiguous buffer via DMA-API to
to implement zero-copy between Dom0 and DomU.

When I start Weston under DomU, I got the next log in Dom0:
```
[ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
5.10.0-yocto-standard+ #312
[ 112.575149] Call trace:
[ 112.577666] dump_backtrace+0x0/0x1b0
[ 112.581373] show_stack+0x18/0x70
[ 112.584746] dump_stack+0xd0/0x12c
[ 112.588200] swiotlb_tbl_map_single+0x234/0x360
[ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
[ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
[ 112.601249] dma_map_sg_attrs+0x54/0x60
[ 112.605138] vsp1_du_map_sg+0x30/0x60
[ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
[ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
[ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
[ 112.623362] drm_atomic_helper_commit+0x88/0x390
[ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
[ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
[ 112.637532] drm_ioctl_kernel+0xc4/0x11c
[ 112.641506] drm_ioctl+0x21c/0x460
[ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
[ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
[ 112.653775] do_el0_svc+0x24/0x90
[ 112.657148] el0_svc+0x14/0x20
[ 112.660254] el0_sync_handler+0x1a4/0x1b0
[ 112.664315] el0_sync+0x174/0x180
[ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 65536 (slots), used 112 (slots)
```
The problem is happened here:
https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202

Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
includes a single page chunk
as shown here:
https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18

After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
Internally, vsp1_du_map_sg() using ops->map_sg (e.g
xen_swiotlb_map_sg) to perform
mapping.

I realized that required segment is too big to be fitted to default
swiotlb segment and condition
https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
is always false.

I know that I use a large buffer, but why can't I map this buffer in one chunk?

Thanks!

ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig :
>
> On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
> > But the question remains: Why does the framebuffer need to be mapped
> > in a single giant chunk?
>
> More importantly: if you use dynamic dma mappings for your framebuffer
> you're doing something wrong.



-- 
Best Regards, Roman.


[PATCH] powerpc/8xx: Simplify TLB handling

2021-09-17 Thread Christophe Leroy
In the old days, TLB handling for 8xx was using tlbie and tlbia
instructions directly as much as possible.

But commit f048aace29e0 ("powerpc/mm: Add SMP support to no-hash
TLB handling") broke that by introducing out-of-line unnecessary
complex functions for booke/smp which don't have tlbie/tlbia
instructions and require more complex handling.

Restore direct use of tlbie and tlbia for 8xx which is never SMP.

With this patch we now get

c00ecc68 :
c00ecc68:   39 00 00 00 li  r8,0
c00ecc6c:   81 46 00 00 lwz r10,0(r6)
c00ecc70:   91 06 00 00 stw r8,0(r6)
c00ecc74:   7c 00 2a 64 tlbie   r5,r0
c00ecc78:   7c 00 04 ac hwsync
c00ecc7c:   91 43 00 00 stw r10,0(r3)
c00ecc80:   4e 80 00 20 blr

Before it was

c0012880 :
c0012880:   2c 03 00 00 cmpwi   r3,0
c0012884:   41 82 00 54 beq c00128d8 

c0012888:   81 22 00 00 lwz r9,0(r2)
c001288c:   81 43 00 20 lwz r10,32(r3)
c0012890:   39 29 00 01 addir9,r9,1
c0012894:   91 22 00 00 stw r9,0(r2)
c0012898:   2c 0a 00 00 cmpwi   r10,0
c001289c:   41 82 00 10 beq c00128ac 

c00128a0:   81 2a 01 dc lwz r9,476(r10)
c00128a4:   2c 09 ff ff cmpwi   r9,-1
c00128a8:   41 82 00 0c beq c00128b4 

c00128ac:   7c 00 22 64 tlbie   r4,r0
c00128b0:   7c 00 04 ac hwsync
c00128b4:   81 22 00 00 lwz r9,0(r2)
c00128b8:   39 29 ff ff addir9,r9,-1
c00128bc:   2c 09 00 00 cmpwi   r9,0
c00128c0:   91 22 00 00 stw r9,0(r2)
c00128c4:   4c a2 00 20 bclr+   4,eq
c00128c8:   81 22 00 70 lwz r9,112(r2)
c00128cc:   71 29 00 04 andi.   r9,r9,4
c00128d0:   4d 82 00 20 beqlr
c00128d4:   48 65 76 74 b   c0669f48 
c00128d8:   81 22 00 00 lwz r9,0(r2)
c00128dc:   39 29 00 01 addir9,r9,1
c00128e0:   91 22 00 00 stw r9,0(r2)
c00128e4:   4b ff ff c8 b   c00128ac 

...
c00ecdc8 :
c00ecdc8:   94 21 ff f0 stwur1,-16(r1)
c00ecdcc:   39 20 00 00 li  r9,0
c00ecdd0:   93 c1 00 08 stw r30,8(r1)
c00ecdd4:   83 c6 00 00 lwz r30,0(r6)
c00ecdd8:   91 26 00 00 stw r9,0(r6)
c00ecddc:   93 e1 00 0c stw r31,12(r1)
c00ecde0:   7c 08 02 a6 mflrr0
c00ecde4:   7c 7f 1b 78 mr  r31,r3
c00ecde8:   7c 83 23 78 mr  r3,r4
c00ecdec:   7c a4 2b 78 mr  r4,r5
c00ecdf0:   90 01 00 14 stw r0,20(r1)
c00ecdf4:   4b f2 5a 8d bl  c0012880 
c00ecdf8:   93 df 00 00 stw r30,0(r31)
c00ecdfc:   7f e3 fb 78 mr  r3,r31
c00ece00:   80 01 00 14 lwz r0,20(r1)
c00ece04:   83 c1 00 08 lwz r30,8(r1)
c00ece08:   83 e1 00 0c lwz r31,12(r1)
c00ece0c:   7c 08 03 a6 mtlrr0
c00ece10:   38 21 00 10 addir1,r1,16
c00ece14:   4e 80 00 20 blr

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/tlbflush.h | 20 
 arch/powerpc/mm/nohash/tlb.c   |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/nohash/tlbflush.h 
b/arch/powerpc/include/asm/nohash/tlbflush.h
index 1edb7243e515..a7ec171d79a6 100644
--- a/arch/powerpc/include/asm/nohash/tlbflush.h
+++ b/arch/powerpc/include/asm/nohash/tlbflush.h
@@ -32,11 +32,31 @@ extern void flush_tlb_range(struct vm_area_struct *vma, 
unsigned long start,
unsigned long end);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
+#ifdef CONFIG_PPC_8xx
+#include 
+
+static inline void local_flush_tlb_mm(struct mm_struct *mm)
+{
+   unsigned int pid = READ_ONCE(mm->context.id);
+
+   if (pid != MMU_NO_CONTEXT) {
+   asm volatile ("sync; tlbia; isync" : : : "memory");
+   trace_tlbia(pid);
+   }
+}
+
+static inline void local_flush_tlb_page(struct vm_area_struct *vma, unsigned 
long vmaddr)
+{
+   asm volatile ("tlbie %0; sync" : : "r" (vmaddr) : "memory");
+   trace_tlbie(0, 0, vmaddr, vma ? vma->vm_mm->context.id : 0, 0, 0, 0);
+}
+#else
 extern void local_flush_tlb_mm(struct mm_struct *mm);
 extern void local_flush_tlb_page(struct vm_area_struct *vma, unsigned long 
vmaddr);
 
 extern void __local_flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
   int tsize, int ind);
+#endif
 
 #ifdef CONFIG_SMP
 extern 

Re: [PATCH v2] powerpc/mce: Fix access error in mce handler

2021-09-17 Thread Ganesh

On 9/17/21 12:09 PM, Daniel Axtens wrote:


Hi Ganesh,


We queue an irq work for deferred processing of mce event
in realmode mce handler, where translation is disabled.
Queuing of the work may result in accessing memory outside
RMO region, such access needs the translation to be enabled
for an LPAR running with hash mmu else the kernel crashes.

After enabling translation in mce_handle_error() we used to
leave it enabled to avoid crashing here, but now with the
commit 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before
returning from handler") we are restoring the MSR to disable
translation.

Hence to fix this enable the translation before queuing the work.

[snip]


Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from 
handler")

That patch changes arch/powerpc/powerpc/platforms/pseries/ras.c just
below this comment:

 /*
  * Enable translation as we will be accessing per-cpu variables
  * in save_mce_event() which may fall outside RMO region, also
  * leave it enabled because subsequently we will be queuing work
  * to workqueues where again per-cpu variables accessed, besides
  * fwnmi_release_errinfo() crashes when called in realmode on
  * pseries.
  * Note: All the realmode handling like flushing SLB entries for
  *   SLB multihit is done by now.
  */

That suggests per-cpu variables need protection. In your patch, you
enable translations just around irq_work_queue:


The comment is bit old, most of it doesn't make any sense now, yes per-cpu
variables cannot be accessed in realmode, but with commit 923b3cf00b3f
("powerpc/mce: Remove per cpu variables from MCE handlers") we moved all of
them to paca.


+   /* Queue irq work to process this event later. Before
+* queuing the work enable translation for non radix LPAR,
+* as irq_work_queue may try to access memory outside RMO
+* region.
+*/
+   if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
+   msr = mfmsr();
+   mtmsr(msr | MSR_IR | MSR_DR);
+   irq_work_queue(_event_process_work);
+   mtmsr(msr);
+   } else {
+   irq_work_queue(_event_process_work);
+   }

However, just before that in the function, there are a few things that
access per-cpu variables via the local_paca, e.g.:

memcpy(_paca->mce_info->mce_event_queue[index],
, sizeof(evt));

Do we need to widen the window where translations are enabled in order
to protect accesses to local_paca?


paca will be within Real Mode Area, so it can be accessed with translate off.



Re: [PATCH v1 1/2] powerpc/64s: system call rfscv workaround for TM bugs

2021-09-17 Thread Daniel Axtens
Nicholas Piggin  writes:

> The rfscv instruction does not work correctly with the fake-suspend mode
> in POWER9, which can end up with the hypervisor restoring an incorrect
> checkpoint.

If I understand correctly from commit 4bb3c7a0208f ("KVM: PPC: Book3S
HV: Work around transactional memory bugs in POWER9"), this is because
rfscv does not cause a soft-patch interrupt in the way that rfid etc do.
So we need to avoid calling rfscv if we are in fake-suspend state -
instead we must call something that does indeed get soft-patched - like
rfid.

> Work around this by setting the _TIF_RESTOREALL flag if a system call
> returns to a transaction active state, causing rfid to be used instead
> of rfscv to return, which will do the right thing. The contents of the
> registers are irrelevant because they will be overwritten in this case
> anyway.

I can follow that this will indeed cause syscall_exit_prepare to return
non-zero and therefore we should take the
syscall_vectored_*_restore_regs path which does an RFID_TO_USER rather
than a RFSCV_TO_USER. My only question/concern is:

.Lsyscall_vectored_\name\()_exit:
addir4,r1,STACK_FRAME_OVERHEAD
li  r5,1 /* scv */
bl  syscall_exit_prepare< we get r3 != 0  here
std r1,PACA_EXIT_SAVE_R1(r13) /* save r1 for restart */
.Lsyscall_vectored_\name\()_rst_start:
lbz r11,PACAIRQHAPPENED(r13)
andi.   r11,r11,(~PACA_IRQ_HARD_DIS)@l
bne-syscall_vectored_\name\()_restart <-- can we end up taking
  this branch?

Are there any circumstances that would take us down the _restart path,
and if so, will we still go through the correct RFID_TO_USER branch
rather than the RFSCV_TO_USER branch?

Apart from that this looks good to me, although with the heavy
disclaimer that I only learned about fake suspend for the first time
while reviewing the patch.

Kind regards,
Daniel

>
> Reported-by: Eirik Fuller 
> Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv 
> instructions")
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/interrupt.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index c77c80214ad3..917a2ac4def6 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -139,6 +139,19 @@ notrace long system_call_exception(long r3, long r4, 
> long r5,
>*/
>   irq_soft_mask_regs_set_state(regs, IRQS_ENABLED);
>  
> + /*
> +  * If system call is called with TM active, set _TIF_RESTOREALL to
> +  * prevent RFSCV being used to return to userspace, because POWER9
> +  * TM implementation has problems with this instruction returning to
> +  * transactional state. Final register values are not relevant because
> +  * the transaction will be aborted upon return anyway. Or in the case
> +  * of unsupported_scv SIGILL fault, the return state does not much
> +  * matter because it's an edge case.
> +  */
> + if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
> + unlikely(MSR_TM_TRANSACTIONAL(regs->msr)))
> + current_thread_info()->flags |= _TIF_RESTOREALL;
> +
>   /*
>* If the system call was made with a transaction active, doom it and
>* return without performing the system call. Unless it was an
> -- 
> 2.23.0


Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-17 Thread Vincent Guittot
On Fri, 17 Sept 2021 at 03:01, Ricardo Neri
 wrote:
>
> On Wed, Sep 15, 2021 at 05:43:44PM +0200, Vincent Guittot wrote:
> > On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
> >  wrote:
> > >
> > > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > > check for the idle state of the destination CPU, dst_cpu, but also of
> > > its SMT siblings.
> > >
> > > If dst_cpu is idle but its SMT siblings are busy, performance suffers
> > > if it pulls tasks from a medium priority CPU that does not have SMT
> > > siblings.
> > >
> > > Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> > > siblings of both dst_cpu and the CPUs in the candidate busiest group.
> > >
> > > Cc: Aubrey Li 
> > > Cc: Ben Segall 
> > > Cc: Daniel Bristot de Oliveira 
> > > Cc: Dietmar Eggemann 
> > > Cc: Mel Gorman 
> > > Cc: Quentin Perret 
> > > Cc: Rafael J. Wysocki 
> > > Cc: Srinivas Pandruvada 
> > > Cc: Steven Rostedt 
> > > Cc: Tim Chen 
> > > Reviewed-by: Joel Fernandes (Google) 
> > > Reviewed-by: Len Brown 
> > > Signed-off-by: Ricardo Neri 
> > > ---
> > > Changes since v4:
> > >   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> > > (Vincent, Peter)
> > >   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
> > >   * Updated function documentation and corrected a typo.
> > >
> > > Changes since v3:
> > >   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> > > powerpc folks showed that this patch should not impact them. Also, 
> > > more
> > > recent powerpc processor no longer use asym_packing. (PeterZ)
> > >   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> > >   * Removed unnecessary check for local CPUs when the local group has zero
> > > utilization. (Joel)
> > >   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> > > the fact that it deals with SMT cases.
> > >   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> > > that callers can deal with non-SMT cases.
> > >
> > > Changes since v2:
> > >   * Reworded the commit message to reflect updates in code.
> > >   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> > > balancing. (PeterZ)
> > >   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> > > sched_asym().
> > >
> > > Changes since v1:
> > >   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> > > tasks. Instead, reclassify the candidate busiest group, as it
> > > may still be selected. (PeterZ)
> > >   * Avoid an expensive and unnecessary call to cpumask_weight() when
> > > determining if a sched_group is comprised of SMT siblings.
> > > (PeterZ).
> > > ---
> > >  kernel/sched/fair.c | 94 +
> > >  1 file changed, 94 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 26db017c14a3..8d763dd0174b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int 
> > > imbalance_pct,
> > > return group_has_spare;
> > >  }
> > >
> > > +/**
> > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > pull tasks
> > > + * @dst_cpu:   Destination CPU of the load balancing
> > > + * @sds:   Load-balancing data with statistics of the local group
> > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > + * @sg:The candidate busiest group
> > > + *
> > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > decide
> > > + * if @dst_cpu can pull tasks.
> > > + *
> > > + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or 
> > > more of
> > > + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, 
> > > pull tasks
> > > + * only if @dst_cpu has higher priority.
> > > + *
> > > + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one 
> > > more
> > > + * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
> > > priority.
> > > + * Bigger imbalances in the number of busy CPUs will be dealt with in
> > > + * update_sd_pick_busiest().
> > > + *
> > > + * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
> > > siblings
> > > + * of @dst_cpu are idle and @sg has lower priority.
> > > + */
> > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > > +   struct sg_lb_stats *sgs,
> > > +   struct sched_group *sg)
> > > +{
> > > +#ifdef CONFIG_SCHED_SMT
> > > +   bool local_is_smt, sg_is_smt;
> > > +   int sg_busy_cpus;
> > > +
> > > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > > +
> > > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;

[PATCH v5 3/5] signal: Add unsafe_copy_siginfo_to_user()

2021-09-17 Thread Christophe Leroy
In the same spirit as commit fb05121fd6a2 ("signal: Add
unsafe_get_compat_sigset()"), implement an 'unsafe' version of
copy_siginfo_to_user() in order to use it within user access blocks.

For that, also add an 'unsafe' version of clear_user().

This commit adds the generic fallback for unsafe_clear_user().
Architectures wanting to use unsafe_copy_siginfo_to_user() within a
user_access_begin() section have to make sure they have their
own unsafe_clear_user().

Signed-off-by: Christophe Leroy 
---
v3: Added precision about unsafe_clear_user() in commit message.
---
 include/linux/signal.h  | 15 +++
 include/linux/uaccess.h |  1 +
 kernel/signal.c |  5 -
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 3f96a6374e4f..70ea7e741427 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t *to,
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from);
 int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from);
 
+static __always_inline char __user *si_expansion(const siginfo_t __user *info)
+{
+   return ((char __user *)info) + sizeof(struct kernel_siginfo);
+}
+
+#define unsafe_copy_siginfo_to_user(to, from, label) do {  \
+   siginfo_t __user *__ucs_to = to;\
+   const kernel_siginfo_t *__ucs_from = from;  \
+   char __user *__ucs_expansion = si_expansion(__ucs_to);  \
+   \
+   unsafe_copy_to_user(__ucs_to, __ucs_from,   \
+   sizeof(struct kernel_siginfo), label);  \
+   unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);   \
+} while (0)
+
 enum siginfo_layout {
SIL_KILL,
SIL_TIMER,
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index c05e903cef02..37073caac474 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, 
long count);
 #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
 #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
 #define unsafe_copy_from_user(d,s,l,e) 
unsafe_op_wrap(__copy_from_user(d,s,l),e)
+#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
 static inline unsigned long user_access_save(void) { return 0UL; }
 static inline void user_access_restore(unsigned long flags) { }
 #endif
diff --git a/kernel/signal.c b/kernel/signal.c
index 952741f6d0f9..23f168730b7e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3324,11 +3324,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int 
si_code)
return layout;
 }
 
-static inline char __user *si_expansion(const siginfo_t __user *info)
-{
-   return ((char __user *)info) + sizeof(struct kernel_siginfo);
-}
-
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from)
 {
char __user *expansion = si_expansion(to);
-- 
2.31.1



[PATCH v5 5/5] powerpc/signal: Use unsafe_copy_siginfo_to_user()

2021-09-17 Thread Christophe Leroy
Use unsafe_copy_siginfo_to_user() in order to do the copy
within the user access block.

On an mpc 8321 (book3s/32) the improvment is about 5% on a process
sending a signal to itself.

Signed-off-by: Christophe Leroy 
---
v5: Added missing __user flag when calling unsafe_copy_siginfo_to_user()

v4: Use another approach for compat: drop the unsafe_copy_siginfo_to_user32(), 
instead directly call copy_siginfo_to_external32() before user_access_begin()

v3: Don't leave compat aside, use the new unsafe_copy_siginfo_to_user32()
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/signal_32.c | 17 -
 arch/powerpc/kernel/signal_64.c |  5 +
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index ff101e2b3bab..0baf3c10b6c0 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -710,12 +710,6 @@ static long restore_tm_user_regs(struct pt_regs *regs, 
struct mcontext __user *s
 }
 #endif
 
-#ifdef CONFIG_PPC64
-
-#define copy_siginfo_to_user   copy_siginfo_to_user32
-
-#endif /* CONFIG_PPC64 */
-
 /*
  * Set up a signal frame for a "real-time" signal handler
  * (one which gets siginfo).
@@ -731,6 +725,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
struct pt_regs *regs = tsk->thread.regs;
/* Save the thread's msr before get_tm_stackpointer() changes it */
unsigned long msr = regs->msr;
+   compat_siginfo_t uinfo;
 
/* Set up Signal Frame */
frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
@@ -744,6 +739,9 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
else
prepare_save_user_regs(1);
 
+   if (IS_ENABLED(CONFIG_COMPAT))
+   copy_siginfo_to_external32(, >info);
+
if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + 16 + sizeof(*frame)))
goto badframe;
 
@@ -779,15 +777,16 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
}
unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed);
+   if (IS_ENABLED(CONFIG_COMPAT))
+   unsafe_copy_to_user(>info, , sizeof(frame->info), 
failed);
+   else
+   unsafe_copy_siginfo_to_user((void __user *)>info, 
>info, failed);
 
/* create a stack frame for the caller of the handler */
unsafe_put_user(regs->gpr[1], newsp, failed);
 
user_access_end();
 
-   if (copy_siginfo_to_user(>info, >info))
-   goto badframe;
-
regs->link = tramp;
 
 #ifdef CONFIG_PPC_FPU_REGS
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index d80ff83cacb9..56c0c74aa28c 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
*set,
}
 
unsafe_copy_to_user(>uc.uc_sigmask, set, sizeof(*set), 
badframe_block);
+   unsafe_copy_siginfo_to_user(>info, >info, badframe_block);
/* Allocate a dummy caller frame for the signal handler. */
unsafe_put_user(regs->gpr[1], newsp, badframe_block);
 
user_write_access_end();
 
-   /* Save the siginfo outside of the unsafe block. */
-   if (copy_siginfo_to_user(>info, >info))
-   goto badframe;
-
/* Make sure signal handler doesn't get spurious FP exceptions */
tsk->thread.fp_state.fpscr = 0;
 
-- 
2.31.1



[PATCH v5 2/5] powerpc/signal: Include the new stack frame inside the user access block

2021-09-17 Thread Christophe Leroy
Include the new stack frame inside the user access block and set it up
using unsafe_put_user().

On an mpc 8321 (book3s/32) the improvment is about 4% on a process
sending a signal to itself.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/signal_32.c | 29 +
 arch/powerpc/kernel/signal_64.c | 14 +++---
 2 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..ff101e2b3bab 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -726,7 +726,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
struct rt_sigframe __user *frame;
struct mcontext __user *mctx;
struct mcontext __user *tm_mctx = NULL;
-   unsigned long newsp = 0;
+   unsigned long __user *newsp;
unsigned long tramp;
struct pt_regs *regs = tsk->thread.regs;
/* Save the thread's msr before get_tm_stackpointer() changes it */
@@ -734,6 +734,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
 
/* Set up Signal Frame */
frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
+   newsp = (unsigned long __user *)((unsigned long)frame - 
(__SIGNAL_FRAMESIZE + 16));
mctx = >uc.uc_mcontext;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
tm_mctx = >uc_transact.uc_mcontext;
@@ -743,7 +744,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
else
prepare_save_user_regs(1);
 
-   if (!user_access_begin(frame, sizeof(*frame)))
+   if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + 16 + sizeof(*frame)))
goto badframe;
 
/* Put the siginfo & fill in most of the ucontext */
@@ -779,6 +780,9 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
}
unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed);
 
+   /* create a stack frame for the caller of the handler */
+   unsafe_put_user(regs->gpr[1], newsp, failed);
+
user_access_end();
 
if (copy_siginfo_to_user(>info, >info))
@@ -790,13 +794,8 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
tsk->thread.fp_state.fpscr = 0; /* turn off all fp exceptions */
 #endif
 
-   /* create a stack frame for the caller of the handler */
-   newsp = ((unsigned long)frame) - (__SIGNAL_FRAMESIZE + 16);
-   if (put_user(regs->gpr[1], (u32 __user *)newsp))
-   goto badframe;
-
/* Fill registers for signal handler */
-   regs->gpr[1] = newsp;
+   regs->gpr[1] = (unsigned long)newsp;
regs->gpr[3] = ksig->sig;
regs->gpr[4] = (unsigned long)>info;
regs->gpr[5] = (unsigned long)>uc;
@@ -826,7 +825,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
struct sigframe __user *frame;
struct mcontext __user *mctx;
struct mcontext __user *tm_mctx = NULL;
-   unsigned long newsp = 0;
+   unsigned long __user *newsp;
unsigned long tramp;
struct pt_regs *regs = tsk->thread.regs;
/* Save the thread's msr before get_tm_stackpointer() changes it */
@@ -834,6 +833,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 
/* Set up Signal Frame */
frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
+   newsp = (unsigned long __user *)((unsigned long)frame - 
__SIGNAL_FRAMESIZE);
mctx = >mctx;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
tm_mctx = >mctx_transact;
@@ -843,7 +843,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
else
prepare_save_user_regs(1);
 
-   if (!user_access_begin(frame, sizeof(*frame)))
+   if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + sizeof(*frame)))
goto badframe;
sc = (struct sigcontext __user *) >sctx;
 
@@ -873,6 +873,8 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
unsafe_put_user(PPC_RAW_SC(), >mc_pad[1], failed);
asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
}
+   /* create a stack frame for the caller of the handler */
+   unsafe_put_user(regs->gpr[1], newsp, failed);
user_access_end();
 
regs->link = tramp;
@@ -881,12 +883,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
tsk->thread.fp_state.fpscr = 0; /* turn off all fp exceptions */
 #endif
 
-   /* create a stack frame for the caller of the handler */
-   newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
-   if (put_user(regs->gpr[1], (u32 __user *)newsp))
-   goto badframe;
-
-   regs->gpr[1] = newsp;
+   regs->gpr[1] = (unsigned long)newsp;
regs->gpr[3] = ksig->sig;
regs->gpr[4] = (unsigned long) sc;
regs_set_return_ip(regs, (unsigned long) ksig->ka.sa.sa_handler);
diff --git a/arch/powerpc/kernel/signal_64.c 

[PATCH v5 4/5] powerpc/uaccess: Add unsafe_clear_user()

2021-09-17 Thread Christophe Leroy
Implement unsafe_clear_user() for powerpc.
It's a copy/paste of unsafe_copy_to_user() with value 0 as source.

It may be improved in a later patch by using 'dcbz' instruction
to zeroize full cache lines at once.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/uaccess.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 22c79ab40006..962b675485ff 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -467,6 +467,26 @@ do {   
\
unsafe_put_user(*(u8*)(_src + _i), (u8 __user *)(_dst + _i), 
e); \
 } while (0)
 
+#define unsafe_clear_user(d, l, e) \
+do {   \
+   u8 __user *_dst = (u8 __user *)(d); \
+   size_t _len = (l);  \
+   int _i; \
+   \
+   for (_i = 0; _i < (_len & ~(sizeof(u64) - 1)); _i += sizeof(u64)) \
+   unsafe_put_user(0, (u64 __user *)(_dst + _i), e);   \
+   if (_len & 4) { \
+   unsafe_put_user(0, (u32 __user *)(_dst + _i), e);   \
+   _i += 4;\
+   }   \
+   if (_len & 2) { \
+   unsafe_put_user(0, (u16 __user *)(_dst + _i), e);   \
+   _i += 2;\
+   }   \
+   if (_len & 1)   \
+   unsafe_put_user(0, (u8 __user *)(_dst + _i), e);\
+} while (0)
+
 #define HAVE_GET_KERNEL_NOFAULT
 
 #define __get_kernel_nofault(dst, src, type, err_label)
\
-- 
2.31.1



[PATCH v5 1/5] powerpc/signal64: Access function descriptor with user access block

2021-09-17 Thread Christophe Leroy
Access the function descriptor of the handler within a
user access block.

Signed-off-by: Christophe Leroy 
---
v3: Flatten the change to avoid nested gotos.
---
 arch/powerpc/kernel/signal_64.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..7b1cd50bc4fb 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -936,8 +936,13 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
func_descr_t __user *funct_desc_ptr =
(func_descr_t __user *) ksig->ka.sa.sa_handler;
 
-   err |= get_user(regs->ctr, _desc_ptr->entry);
-   err |= get_user(regs->gpr[2], _desc_ptr->toc);
+   if (!user_read_access_begin(funct_desc_ptr, 
sizeof(func_descr_t)))
+   goto badfunc;
+
+   unsafe_get_user(regs->ctr, _desc_ptr->entry, 
badfunc_block);
+   unsafe_get_user(regs->gpr[2], _desc_ptr->toc, 
badfunc_block);
+
+   user_read_access_end();
}
 
/* enter the signal handler in native-endian mode */
@@ -962,5 +967,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 badframe:
signal_fault(current, regs, "handle_rt_signal64", frame);
 
+   return 1;
+
+badfunc_block:
+   user_read_access_end();
+badfunc:
+   signal_fault(current, regs, __func__, (void __user 
*)ksig->ka.sa.sa_handler);
+
return 1;
 }
-- 
2.31.1



Re: [PATCH v2] powerpc/mce: Fix access error in mce handler

2021-09-17 Thread Daniel Axtens
Hi Ganesh,

> We queue an irq work for deferred processing of mce event
> in realmode mce handler, where translation is disabled.
> Queuing of the work may result in accessing memory outside
> RMO region, such access needs the translation to be enabled
> for an LPAR running with hash mmu else the kernel crashes.
>
> After enabling translation in mce_handle_error() we used to
> leave it enabled to avoid crashing here, but now with the
> commit 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before
> returning from handler") we are restoring the MSR to disable
> translation.
>
> Hence to fix this enable the translation before queuing the work.

[snip]

> Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from 
> handler")

That patch changes arch/powerpc/powerpc/platforms/pseries/ras.c just
below this comment:

/*
 * Enable translation as we will be accessing per-cpu variables
 * in save_mce_event() which may fall outside RMO region, also
 * leave it enabled because subsequently we will be queuing work
 * to workqueues where again per-cpu variables accessed, besides
 * fwnmi_release_errinfo() crashes when called in realmode on
 * pseries.
 * Note: All the realmode handling like flushing SLB entries for
 *   SLB multihit is done by now.
 */

That suggests per-cpu variables need protection. In your patch, you
enable translations just around irq_work_queue:

> + /* Queue irq work to process this event later. Before
> +  * queuing the work enable translation for non radix LPAR,
> +  * as irq_work_queue may try to access memory outside RMO
> +  * region.
> +  */
> + if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
> + msr = mfmsr();
> + mtmsr(msr | MSR_IR | MSR_DR);
> + irq_work_queue(_event_process_work);
> + mtmsr(msr);
> + } else {
> + irq_work_queue(_event_process_work);
> + }

However, just before that in the function, there are a few things that
access per-cpu variables via the local_paca, e.g.:

memcpy(_paca->mce_info->mce_event_queue[index],
   , sizeof(evt));

Do we need to widen the window where translations are enabled in order
to protect accesses to local_paca?

Kind regards,
Daniel