Re: [PATCH v2] iommu/vt-d: avoid unnecessory panic if iommu init fail in tboot system

2020-11-17 Thread Lu Baolu

+Will

Please consider this patch for v5.10.

Best regards,
baolu

On 2020/11/10 15:19, Zhenzhong Duan wrote:

"intel_iommu=off" command line is used to disable iommu but iommu is force
enabled in a tboot system for security reason.

However for better performance on high speed network device, a new option
"intel_iommu=tboot_noforce" is introduced to disable the force on.

By default kernel should panic if iommu init fail in tboot for security
reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off".

Fix the code setting force_on and move intel_iommu_tboot_noforce
from tboot code to intel iommu code.

Fixes: 7304e8f28bb2 ("iommu/vt-d: Correctly disable Intel IOMMU force on")
Signed-off-by: Zhenzhong Duan 
---
v2: move ckeck of intel_iommu_tboot_noforce into iommu code per Baolu.

  arch/x86/kernel/tboot.c | 3 ---
  drivers/iommu/intel/iommu.c | 5 +++--
  include/linux/intel-iommu.h | 1 -
  3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 992fb14..420be87 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -514,9 +514,6 @@ int tboot_force_iommu(void)
if (!tboot_enabled())
return 0;
  
-	if (intel_iommu_tboot_noforce)

-   return 1;
-
if (no_iommu || swiotlb || dmar_disabled)
pr_warn("Forcing Intel-IOMMU to enabled\n");
  
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c

index 1b1ca63..4d9b298 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -179,7 +179,7 @@ static inline unsigned long virt_to_dma_pfn(void *p)
   * (used when kernel is launched w/ TXT)
   */
  static int force_on = 0;
-int intel_iommu_tboot_noforce;
+static int intel_iommu_tboot_noforce;
  static int no_platform_optin;
  
  #define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))

@@ -4885,7 +4885,8 @@ int __init intel_iommu_init(void)
 * Intel IOMMU is required for a TXT/tboot launch or platform
 * opt in, so enforce that.
 */
-   force_on = tboot_force_iommu() || platform_optin_force_iommu();
+   force_on = (!intel_iommu_tboot_noforce && tboot_force_iommu()) ||
+   platform_optin_force_iommu();
  
  	if (iommu_init_mempool()) {

if (force_on)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index fbf5b3e..d956987 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -798,7 +798,6 @@ struct context_entry *iommu_context_addr(struct intel_iommu 
*iommu, u8 bus,
  extern int iommu_calculate_max_sagaw(struct intel_iommu *iommu);
  extern int dmar_disabled;
  extern int intel_iommu_enabled;
-extern int intel_iommu_tboot_noforce;
  extern int intel_iommu_gfx_mapped;
  #else
  static inline int iommu_calculate_agaw(struct intel_iommu *iommu)



Re: [PATCH v7 1/8] block: ensure bios are not split in middle of crypto data unit

2020-11-17 Thread Eric Biggers
On Tue, Nov 17, 2020 at 02:07:01PM +, Satya Tangirala wrote:
> Introduce blk_crypto_bio_sectors_alignment() that returns the required
> alignment for the number of sectors in a bio. Any bio split must ensure
> that the number of sectors in the resulting bios is aligned to that
> returned value. This patch also updates __blk_queue_split(),
> __blk_queue_bounce() and blk_crypto_split_bio_if_needed() to respect
> blk_crypto_bio_sectors_alignment() when splitting bios.
> 
> Signed-off-by: Satya Tangirala 
> ---
>  block/bio.c |  1 +
>  block/blk-crypto-fallback.c | 10 ++--
>  block/blk-crypto-internal.h | 18 +++
>  block/blk-merge.c   | 96 -
>  block/blk-mq.c  |  3 ++
>  block/bounce.c  |  4 ++
>  6 files changed, 117 insertions(+), 15 deletions(-)
> 

I feel like this should be split into multiple patches: one patch that
introduces blk_crypto_bio_sectors_alignment(), and a patch for each place that
needs to take blk_crypto_bio_sectors_alignment() into account.

It would also help to give a real-world example of why support for
data_unit_size > logical_block_size is needed.  E.g. ext4 or f2fs encryption
with a 4096-byte filesystem block size, using eMMC inline encryption hardware
that has logical_block_size=512.

Also, is this needed even without the fscrypt direct I/O support?  If so, it
should be sent out separately.

> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index bcf5e4580603..f34dda7132f9 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -149,13 +149,15 @@ static inline unsigned get_max_io_size(struct 
> request_queue *q,
>   unsigned pbs = queue_physical_block_size(q) >> SECTOR_SHIFT;
>   unsigned lbs = queue_logical_block_size(q) >> SECTOR_SHIFT;
>   unsigned start_offset = bio->bi_iter.bi_sector & (pbs - 1);
> + unsigned int bio_sectors_alignment =
> + blk_crypto_bio_sectors_alignment(bio);
>  
>   max_sectors += start_offset;
>   max_sectors &= ~(pbs - 1);
> - if (max_sectors > start_offset)
> - return max_sectors - start_offset;
> + if (max_sectors - start_offset >= bio_sectors_alignment)
> + return round_down(max_sectors - start_offset, 
> bio_sectors_alignment);
>  
> - return sectors & ~(lbs - 1);
> + return round_down(sectors & ~(lbs - 1), bio_sectors_alignment);
>  }

'max_sectors - start_offset >= bio_sectors_alignment' looks wrong, as
'max_sectors - start_offset' underflows if 'max_sectors < start_offset'.

Maybe consider something like the below?

static inline unsigned get_max_io_size(struct request_queue *q,
   struct bio *bio)
{
unsigned sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
unsigned pbs = queue_physical_block_size(q) >> SECTOR_SHIFT;
unsigned lbs = queue_logical_block_size(q) >> SECTOR_SHIFT;
sector_t pb_aligned_sector =
round_down(bio->bi_iter.bi_sector + sectors, pbs);

lbs = max(lbs, blk_crypto_bio_sectors_alignment(bio));

if (pb_aligned_sector >= bio->bi_iter.bi_sector + lbs)
sectors = pb_aligned_sector - bio->bi_iter.bi_sector;

return round_down(sectors, lbs);
}

Maybe it would be useful to have a helper function bio_required_alignment() that
returns the crypto data unit size if the bio has an encryption context, and the
logical block size if it doesn't?

>  
>  static inline unsigned get_max_segment_size(const struct request_queue *q,
> @@ -174,6 +176,41 @@ static inline unsigned get_max_segment_size(const struct 
> request_queue *q,
>   (unsigned long)queue_max_segment_size(q));
>  }
>  
> +/**
> + * update_aligned_sectors_and_segs() - Ensures that *@aligned_sectors is 
> aligned
> + *  to @bio_sectors_alignment, and that
> + *  *@aligned_segs is the value of nsegs
> + *  when sectors reached/first exceeded that
> + *  value of *@aligned_sectors.
> + *
> + * @nsegs: [in] The current number of segs
> + * @sectors: [in] The current number of sectors
> + * @aligned_segs: [in,out] The number of segments that make up 
> @aligned_sectors
> + * @aligned_sectors: [in,out] The largest number of sectors <= @sectors that 
> is
> + *aligned to @sectors
> + * @bio_sectors_alignment: [in] The alignment requirement for the number of
> + * sectors
> + *
> + * Updates *@aligned_sectors to the largest number <= @sectors that is also a
> + * multiple of @bio_sectors_alignment. This is done by updating 
> *@aligned_sectors
> + * whenever @sectors is at least @bio_sectors_alignment more than
> + * *@aligned_sectors, since that means we can increment *@aligned_sectors 
> while
> + * still keeping it aligned to @bio_sectors_alignment and also keeping it <=
> + * 

Re: [RESEND][PATCH] ima: Set and clear FMODE_CAN_READ in ima_calc_file_hash()

2020-11-17 Thread Linus Torvalds
On Tue, Nov 17, 2020 at 3:24 PM Mimi Zohar  wrote:
>
> I really wish it wasn't needed.

Seriously, I get the feeling that IMA is completely mis-designed, and
is doing actively bad things.

Who uses this "feature", and who cares? Because I would suggest you
just change the policy and be done with it.

Linus


[PATCH bpf-next v4 2/2] bpf: Add tests for bpf_bprm_opts_set helper

2020-11-17 Thread KP Singh
From: KP Singh 

The test forks a child process, updates the local storage to set/unset
the securexec bit.

The BPF program in the test attaches to bprm_creds_for_exec which checks
the local storage of the current task to set the secureexec bit on the
binary parameters (bprm).

The child then execs a bash command with the environment variable
TMPDIR set in the envp.  The bash command returns a different exit code
based on its observed value of the TMPDIR variable.

Since TMPDIR is one of the variables that is ignored by the dynamic
loader when the secureexec bit is set, one should expect the
child execution to not see this value when the secureexec bit is set.

Acked-by: Martin KaFai Lau 
Signed-off-by: KP Singh 
---
 .../selftests/bpf/prog_tests/test_bprm_opts.c | 116 ++
 tools/testing/selftests/bpf/progs/bprm_opts.c |  34 +
 2 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bprm_opts.c
 create mode 100644 tools/testing/selftests/bpf/progs/bprm_opts.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_bprm_opts.c 
b/tools/testing/selftests/bpf/prog_tests/test_bprm_opts.c
new file mode 100644
index ..2559bb775762
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_bprm_opts.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (C) 2020 Google LLC.
+ */
+
+#include 
+#include 
+
+#include "bprm_opts.skel.h"
+#include "network_helpers.h"
+
+#ifndef __NR_pidfd_open
+#define __NR_pidfd_open 434
+#endif
+
+static const char * const bash_envp[] = { "TMPDIR=shouldnotbeset", NULL };
+
+static inline int sys_pidfd_open(pid_t pid, unsigned int flags)
+{
+   return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static int update_storage(int map_fd, int secureexec)
+{
+   int task_fd, ret = 0;
+
+   task_fd = sys_pidfd_open(getpid(), 0);
+   if (task_fd < 0)
+   return errno;
+
+   ret = bpf_map_update_elem(map_fd, _fd, , BPF_NOEXIST);
+   if (ret)
+   ret = errno;
+
+   close(task_fd);
+   return ret;
+}
+
+static int run_set_secureexec(int map_fd, int secureexec)
+{
+   int child_pid, child_status, ret, null_fd;
+
+   child_pid = fork();
+   if (child_pid == 0) {
+   null_fd = open("/dev/null", O_WRONLY);
+   if (null_fd == -1)
+   exit(errno);
+   dup2(null_fd, STDOUT_FILENO);
+   dup2(null_fd, STDERR_FILENO);
+   close(null_fd);
+
+   /* Ensure that all executions from hereon are
+* secure by setting a local storage which is read by
+* the bprm_creds_for_exec hook and sets bprm->secureexec.
+*/
+   ret = update_storage(map_fd, secureexec);
+   if (ret)
+   exit(ret);
+
+   /* If the binary is executed with securexec=1, the dynamic
+* loader ingores and unsets certain variables like LD_PRELOAD,
+* TMPDIR etc. TMPDIR is used here to simplify the example, as
+* LD_PRELOAD requires a real .so file.
+*
+* If the value of TMPDIR is set, the bash command returns 10
+* and if the value is unset, it returns 20.
+*/
+   execle("/bin/bash", "bash", "-c",
+  "[[ -z \"${TMPDIR}\" ]] || exit 10 && exit 20", NULL,
+  bash_envp);
+   exit(errno);
+   } else if (child_pid > 0) {
+   waitpid(child_pid, _status, 0);
+   ret = WEXITSTATUS(child_status);
+
+   /* If a secureexec occurred, the exit status should be 20 */
+   if (secureexec && ret == 20)
+   return 0;
+
+   /* If normal execution happened, the exit code should be 10 */
+   if (!secureexec && ret == 10)
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+void test_test_bprm_opts(void)
+{
+   int err, duration = 0;
+   struct bprm_opts *skel = NULL;
+
+   skel = bprm_opts__open_and_load();
+   if (CHECK(!skel, "skel_load", "skeleton failed\n"))
+   goto close_prog;
+
+   err = bprm_opts__attach(skel);
+   if (CHECK(err, "attach", "attach failed: %d\n", err))
+   goto close_prog;
+
+   /* Run the test with the secureexec bit unset */
+   err = run_set_secureexec(bpf_map__fd(skel->maps.secure_exec_task_map),
+0 /* secureexec */);
+   if (CHECK(err, "run_set_secureexec:0", "err = %d\n", err))
+   goto close_prog;
+
+   /* Run the test with the secureexec bit set */
+   err = run_set_secureexec(bpf_map__fd(skel->maps.secure_exec_task_map),
+1 /* secureexec */);
+   if (CHECK(err, "run_set_secureexec:1", "err = %d\n", err))
+   goto close_prog;
+

Re: [PATCH 11/24] x86/resctrl: Group staged configuration into a separate struct

2020-11-17 Thread Reinette Chatre

Hi James,

On 10/30/2020 9:11 AM, James Morse wrote:

Arm's MPAM may have surprisingly large bitmaps for its cache
portions as the architecture allows up to 4K portions. The size
exposed via resctrl may not be the same, some scaling may
occur.

The values written to hardware may be unlike the values received
from resctrl, e.g. MBA percentages may be backed by a bitmap,
or a maximum value that isn't a percentage.

Today resctrl's ctrlval arrays are written to directly by the


If using a cryptic word like "ctrlval" it would be easier to understand 
what is meant if it matches the variable in the code, "ctrl_val".



resctrl filesystem code. e.g. apply_config(). This is a problem


This sentence starts with "Today" implying what code does before this 
change but the example function, apply_config() is introduced in this patch.



if scaling or conversion is needed by the architecture.

The arch code should own the ctrlval array (to allow scaling and
conversion), and should only need a single copy of the array for the
values currently applied in hardware.


ok, but that is the case, no?



Move the new_ctrl bitmap value and flag into a struct for staged
configuration changes. This is created as an array to allow one per type


This is a bit cryptic as the reader may not know while reading this 
commit message what "new_ctrl" is or where it is currently hosted.



of configuration. Today there is only one element in the array, but
eventually resctrl will use the array slots for CODE/DATA/BOTH to detect
a duplicate schema being written.

Signed-off-by: James Morse 
---
  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 49 ---
  arch/x86/kernel/cpu/resctrl/rdtgroup.c| 22 +-
  include/linux/resctrl.h   | 17 +---
  3 files changed, 60 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c 
b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 28d69c78c29e..0c95ed83eb05 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c


...


@@ -240,15 +244,30 @@ static int parse_line(char *line, struct resctrl_schema 
*s,
return -EINVAL;
  }
  
+static void apply_config(struct rdt_hw_domain *hw_dom,

+struct resctrl_staged_config *cfg, int closid,
+cpumask_var_t cpu_mask, bool mba_sc)
+{
+   struct rdt_domain *dom = _dom->resctrl;
+   u32 *dc = mba_sc ? hw_dom->mbps_val : hw_dom->ctrl_val;
+
+   if (cfg->new_ctrl != dc[closid]) {
+   cpumask_set_cpu(cpumask_any(>cpu_mask), cpu_mask);
+   dc[closid] = cfg->new_ctrl;
+   }
+
+   cfg->have_new_ctrl = false;


Why is this necessary?


+}
+
  int update_domains(struct rdt_resource *r, int closid)
  {
+   struct resctrl_staged_config *cfg;
struct rdt_hw_domain *hw_dom;
struct msr_param msr_param;
cpumask_var_t cpu_mask;
struct rdt_domain *d;
bool mba_sc;
-   u32 *dc;
-   int cpu;
+   int cpu, i;
  
  	if (!zalloc_cpumask_var(_mask, GFP_KERNEL))

return -ENOMEM;
@@ -260,10 +279,12 @@ int update_domains(struct rdt_resource *r, int closid)
mba_sc = is_mba_sc(r);
list_for_each_entry(d, >domains, list) {
hw_dom = resctrl_to_arch_dom(d);
-   dc = !mba_sc ? hw_dom->ctrl_val : hw_dom->mbps_val;
-   if (d->have_new_ctrl && d->new_ctrl != dc[closid]) {
-   cpumask_set_cpu(cpumask_any(>cpu_mask), cpu_mask);
-   dc[closid] = d->new_ctrl;
+   for (i = 0; i < ARRAY_SIZE(d->staged_config); i++) {


I understand it may make later patches easier but it seems too early to 
introduce this loop because apply_config() does not seem to be ready for 
it yet (it would just keep overwriting a closid's config).



+   cfg = _dom->resctrl.staged_config[i];
+   if (!cfg->have_new_ctrl)
+   continue;
+
+   apply_config(hw_dom, cfg, closid, cpu_mask, mba_sc);
}
}
  
@@ -338,7 +359,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
  
  	list_for_each_entry(s, _all_schema, list) {

list_for_each_entry(dom, >res->domains, list)
-   dom->have_new_ctrl = false;
+   memset(dom->staged_config, 0, 
sizeof(dom->staged_config));
}
  
  	while ((tok = strsep(, "\n")) != NULL) {


...


diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 9f71f0238239..f1164bbb66c5 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -26,13 +26,21 @@ enum resctrl_conf_type {
CDP_DATA,
  };
  
+/**

+ * struct resctrl_staged_config - parsed configuration to be applied
+ * @new_ctrl:  new ctrl value to be loaded
+ * @have_new_ctrl: did user provide new_ctrl for this domain


The "for this domain" in 

[PATCH bpf-next v4 1/2] bpf: Add bpf_bprm_opts_set helper

2020-11-17 Thread KP Singh
From: KP Singh 

The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_F_BPRM_SECUREEXEC flag.

secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc.  When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).

Acked-by: Martin KaFai Lau 
Signed-off-by: KP Singh 
---
 include/uapi/linux/bpf.h   | 16 
 kernel/bpf/bpf_lsm.c   | 26 ++
 scripts/bpf_helpers_doc.py |  2 ++
 tools/include/uapi/linux/bpf.h | 16 
 4 files changed, 60 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 162999b12790..a52299b80b9d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3787,6 +3787,16 @@ union bpf_attr {
  * *ARG_PTR_TO_BTF_ID* of type *task_struct*.
  * Return
  * Pointer to the current task.
+ *
+ * long bpf_bprm_opts_set(struct linux_binprm *bprm, u64 flags)
+ * Description
+ * Set or clear certain options on *bprm*:
+ *
+ * **BPF_F_BPRM_SECUREEXEC** Set the secureexec bit
+ * which sets the **AT_SECURE** auxv for glibc. The bit
+ * is cleared if the flag is not specified.
+ * Return
+ * **-EINVAL** if invalid *flags* are passed, zero otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -3948,6 +3958,7 @@ union bpf_attr {
FN(task_storage_get),   \
FN(task_storage_delete),\
FN(get_current_task_btf),   \
+   FN(bprm_opts_set),  \
/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4119,6 +4130,11 @@ enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_IP,
 };
 
+/* Flags for bpf_bprm_opts_set helper */
+enum {
+   BPF_F_BPRM_SECUREEXEC   = (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)   \
 union {\
type name;  \
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 553107f4706a..b4f27a874092 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,29 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
return 0;
 }
 
+/* Mask for all the currently supported BPRM option flags */
+#define BPF_F_BRPM_OPTS_MASK   BPF_F_BPRM_SECUREEXEC
+
+BPF_CALL_2(bpf_bprm_opts_set, struct linux_binprm *, bprm, u64, flags)
+{
+   if (flags & ~BPF_F_BRPM_OPTS_MASK)
+   return -EINVAL;
+
+   bprm->secureexec = (flags & BPF_F_BPRM_SECUREEXEC);
+   return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_bprm_opts_set_btf_ids, struct, linux_binprm)
+
+const static struct bpf_func_proto bpf_bprm_opts_set_proto = {
+   .func   = bpf_bprm_opts_set,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_BTF_ID,
+   .arg1_btf_id= _bprm_opts_set_btf_ids[0],
+   .arg2_type  = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -71,6 +95,8 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _task_storage_get_proto;
case BPF_FUNC_task_storage_delete:
return _task_storage_delete_proto;
+   case BPF_FUNC_bprm_opts_set:
+   return _bprm_opts_set_proto;
default:
return tracing_prog_func_proto(func_id, prog);
}
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 31484377b8b1..c5bc947a70ad 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -418,6 +418,7 @@ class PrinterHelpers(Printer):
 'struct bpf_tcp_sock',
 'struct bpf_tunnel_key',
 'struct bpf_xfrm_state',
+'struct linux_binprm',
 'struct pt_regs',
 'struct sk_reuseport_md',
 'struct sockaddr',
@@ -465,6 +466,7 @@ class PrinterHelpers(Printer):
 'struct bpf_tcp_sock',
 'struct bpf_tunnel_key',
 'struct bpf_xfrm_state',
+'struct linux_binprm',
 'struct pt_regs',
 'struct sk_reuseport_md',
 'struct sockaddr',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 162999b12790..a52299b80b9d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3787,6 +3787,16 @@ union bpf_attr {
  * *ARG_PTR_TO_BTF_ID* of type *task_struct*.
  * Return
  * Pointer to the current task.
+ *
+ * long bpf_bprm_opts_set(struct linux_binprm *bprm, u64 flags)
+ * Description
+ *   

RE: [PATCH v2 5/8] usb: typec: Use Thunderbolt 3 cable discover mode VDO in Enter_USB message

2020-11-17 Thread Patel, Utkarsh H
Hi Heikki,

> -Original Message-
> From: Heikki Krogerus 
> Sent: Tuesday, November 17, 2020 4:10 AM
> To: Patel, Utkarsh H 
> Cc: linux-kernel@vger.kernel.org; linux-...@vger.kernel.org;
> pmal...@chromium.org; enric.balle...@collabora.com; Mani, Rajmohan
> ; Shaikh, Azhar 
> Subject: Re: [PATCH v2 5/8] usb: typec: Use Thunderbolt 3 cable discover
> mode VDO in Enter_USB message
> 
> On Fri, Nov 13, 2020 at 12:25:00PM -0800, Utkarsh Patel wrote:
> > USB4 also uses same cable properties as Thunderbolt 3 so use
> > Thunderbolt 3 cable discover mode VDO to fill details such as active
> > cable plug link training and cable rounded support.
> 
> I'm sorry, but I think that has to be explained better. We only need the
> Thunderbolt 3 properties when we create the USB4 connection with
> Thunderbolt 3 cables. With USB4 cables that information is simply not
> available. Claiming that USB4 uses the same properties in general is not true.

Ack. I will change the commit message.  

> 
> > Suggested-by: Heikki Krogerus 
> > Signed-off-by: Utkarsh Patel 
> > --
> > Changes in v2:
> > - No change.
> > --
> > ---
> >  include/linux/usb/typec.h | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/include/linux/usb/typec.h b/include/linux/usb/typec.h
> > index 6be558045942..d91e09d9d91c 100644
> > --- a/include/linux/usb/typec.h
> > +++ b/include/linux/usb/typec.h
> > @@ -75,6 +75,7 @@ enum typec_orientation {
> >  /*
> >   * struct enter_usb_data - Enter_USB Message details
> >   * @eudo: Enter_USB Data Object
> > + * @tbt_cable_vdo: TBT3 Cable Discover Mode Response
> >   * @active_link_training: Active Cable Plug Link Training
> >   *
> >   * @active_link_training is a flag that should be set with
> > uni-directional SBRX
> 
> Please also explain the same here with a short comment. So basically, if the
> USB4 connection is created using TBT3 cable, then we need to supply also the
> TBT3 Cable VDO as part of this data. But if USB4 cable is used, then that
> member should not be filled at all.

Ack. 

> 
> > @@ -83,6 +84,7 @@ enum typec_orientation {
> >   */
> >  struct enter_usb_data {
> > u32 eudo;
> > +   u32 tbt_cable_vdo;
> > unsigned char   active_link_training:1;
> >  };
> 
> thanks,
> 
> --
> Heikki

Sincerely,
Utkarsh Patel. 


[PATCH v7 13/13] misc: bcm-vk: add ttyVK support

2020-11-17 Thread Scott Branden
Add ttyVK support to driver to allow console access to VK card from host.

Device node will be in the follow form /dev/bcm-vk.x_ttyVKy where:
x is the instance of the VK card
y is the tty device number on the VK card

Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/Makefile |   3 +-
 drivers/misc/bcm-vk/bcm_vk.h |  28 +++
 drivers/misc/bcm-vk/bcm_vk_dev.c |  30 ++-
 drivers/misc/bcm-vk/bcm_vk_tty.c | 333 +++
 4 files changed, 392 insertions(+), 2 deletions(-)
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_tty.c

diff --git a/drivers/misc/bcm-vk/Makefile b/drivers/misc/bcm-vk/Makefile
index 79b4e365c9e6..e4a1486f7209 100644
--- a/drivers/misc/bcm-vk/Makefile
+++ b/drivers/misc/bcm-vk/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_BCM_VK) += bcm_vk.o
 bcm_vk-objs := \
bcm_vk_dev.o \
bcm_vk_msg.o \
-   bcm_vk_sg.o
+   bcm_vk_sg.o \
+   bcm_vk_tty.o
 
diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index a1d0bf6e694c..3f37c640a814 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -8,12 +8,14 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -84,6 +86,9 @@
 #define CODEPUSH_BOOT2_ENTRY   0x6000
 
 #define BAR_CARD_STATUS0x410
+/* CARD_STATUS definitions */
+#define CARD_STATUS_TTYVK0_READY   BIT(0)
+#define CARD_STATUS_TTYVK1_READY   BIT(1)
 
 #define BAR_BOOT1_STDALONE_PROGRESS0x420
 #define BOOT1_STDALONE_SUCCESS (BIT(13) | BIT(14))
@@ -255,6 +260,19 @@ enum pci_barno {
 
 #define BCM_VK_NUM_TTY 2
 
+struct bcm_vk_tty {
+   struct tty_port port;
+   u32 to_offset;  /* bar offset to use */
+   u32 to_size;/* to VK buffer size */
+   u32 wr; /* write offset shadow */
+   u32 from_offset;/* bar offset to use */
+   u32 from_size;  /* from VK buffer size */
+   u32 rd; /* read offset shadow */
+   pid_t pid;
+   bool irq_enabled;
+   bool is_opened; /* tracks tty open/close */
+};
+
 /* VK device max power state, supports 3, full, reduced and low */
 #define MAX_OPP 3
 #define MAX_CARD_INFO_TAG_SIZE 64
@@ -348,6 +366,12 @@ struct bcm_vk {
struct miscdevice miscdev;
int devid; /* dev id allocated */
 
+   struct tty_driver *tty_drv;
+   struct timer_list serial_timer;
+   struct bcm_vk_tty tty[BCM_VK_NUM_TTY];
+   struct workqueue_struct *tty_wq_thread;
+   struct work_struct tty_wq_work;
+
/* Reference-counting to handle file operations */
struct kref kref;
 
@@ -466,6 +490,7 @@ int bcm_vk_release(struct inode *inode, struct file 
*p_file);
 void bcm_vk_release_data(struct kref *kref);
 irqreturn_t bcm_vk_msgq_irqhandler(int irq, void *dev_id);
 irqreturn_t bcm_vk_notf_irqhandler(int irq, void *dev_id);
+irqreturn_t bcm_vk_tty_irqhandler(int irq, void *dev_id);
 int bcm_vk_msg_init(struct bcm_vk *vk);
 void bcm_vk_msg_remove(struct bcm_vk *vk);
 void bcm_vk_drain_msg_on_reset(struct bcm_vk *vk);
@@ -476,6 +501,9 @@ int bcm_vk_send_shutdown_msg(struct bcm_vk *vk, u32 
shut_type,
 const pid_t pid, const u32 q_num);
 void bcm_to_v_q_doorbell(struct bcm_vk *vk, u32 q_num, u32 db_val);
 int bcm_vk_auto_load_all_images(struct bcm_vk *vk);
+int bcm_vk_tty_init(struct bcm_vk *vk, char *name);
+void bcm_vk_tty_exit(struct bcm_vk *vk);
+void bcm_vk_tty_terminate_tty_user(struct bcm_vk *vk);
 void bcm_vk_hb_init(struct bcm_vk *vk);
 void bcm_vk_hb_deinit(struct bcm_vk *vk);
 void bcm_vk_handle_notf(struct bcm_vk *vk);
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index cac07419f041..c3d2bba68ef1 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -525,6 +525,7 @@ void bcm_vk_blk_drv_access(struct bcm_vk *vk)
}
}
}
+   bcm_vk_tty_terminate_tty_user(vk);
spin_unlock(>ctx_lock);
 }
 
@@ -1384,6 +1385,20 @@ static int bcm_vk_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
}
vk->num_irqs++;
 
+   for (i = 0;
+(i < VK_MSIX_TTY_MAX) && (vk->num_irqs < irq);
+i++, vk->num_irqs++) {
+   err = devm_request_irq(dev, pci_irq_vector(pdev, vk->num_irqs),
+  bcm_vk_tty_irqhandler,
+  IRQF_SHARED, DRV_MODULE_NAME, vk);
+   if (err) {
+   dev_err(dev, "failed request tty IRQ %d for MSIX %d\n",
+   pdev->irq + vk->num_irqs, vk->num_irqs + 1);
+   goto err_irq;
+   }
+   vk->tty[i].irq_enabled = true;
+   }
+
id = ida_simple_get(_vk_ida, 0, 0, GFP_KERNEL);
if (id < 0) {
err = id;
@@ -1436,6 +1451,11 @@ static int 

[PATCH v7 12/13] MAINTAINERS: bcm-vk: add maintainer for Broadcom VK Driver

2020-11-17 Thread Scott Branden
Add maintainer entry for new Broadcom VK Driver

Signed-off-by: Scott Branden 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e451dcce054f..f574946f2f56 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3716,6 +3716,13 @@ L:   net...@vger.kernel.org
 S: Supported
 F: drivers/net/ethernet/broadcom/tg3.*
 
+BROADCOM VK DRIVER
+M: Scott Branden 
+L: bcm-kernel-feedback-l...@broadcom.com
+S: Supported
+F: drivers/misc/bcm-vk/
+F: include/uapi/linux/misc/bcm_vk.h
+
 BROCADE BFA FC SCSI DRIVER
 M: Anil Gurumurthy 
 M: Sudarsana Kalluru 
-- 
2.17.1



[PATCH v7 10/13] misc: bcm-vk: reset_pid support

2020-11-17 Thread Scott Branden
Add reset support via ioctl.
Kill user processes that are open when VK card is reset.
If a particular PID has issued the reset request do not kill that process
as it issued the ioctl.

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk.h |   1 +
 drivers/misc/bcm-vk/bcm_vk_dev.c | 158 +--
 drivers/misc/bcm-vk/bcm_vk_msg.c |  40 +++-
 3 files changed, 191 insertions(+), 8 deletions(-)

diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index d847a512d0ed..a1d0bf6e694c 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -468,6 +468,7 @@ irqreturn_t bcm_vk_msgq_irqhandler(int irq, void *dev_id);
 irqreturn_t bcm_vk_notf_irqhandler(int irq, void *dev_id);
 int bcm_vk_msg_init(struct bcm_vk *vk);
 void bcm_vk_msg_remove(struct bcm_vk *vk);
+void bcm_vk_drain_msg_on_reset(struct bcm_vk *vk);
 int bcm_vk_sync_msgq(struct bcm_vk *vk, bool force_sync);
 void bcm_vk_blk_drv_access(struct bcm_vk *vk);
 s32 bcm_to_h_msg_dequeue(struct bcm_vk *vk);
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index 5d82f02c0f27..e572a7b18fab 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -504,7 +504,9 @@ void bcm_vk_blk_drv_access(struct bcm_vk *vk)
int i;
 
/*
-* kill all the apps
+* kill all the apps except for the process that is resetting.
+* If not called during reset, reset_pid will be 0, and all will be
+* killed.
 */
spin_lock(>ctx_lock);
 
@@ -515,10 +517,12 @@ void bcm_vk_blk_drv_access(struct bcm_vk *vk)
struct bcm_vk_ctx *ctx;
 
list_for_each_entry(ctx, >pid_ht[i].head, node) {
-   dev_dbg(>pdev->dev,
-   "Send kill signal to pid %d\n",
-   ctx->pid);
-   kill_pid(find_vpid(ctx->pid), SIGKILL, 1);
+   if (ctx->pid != vk->reset_pid) {
+   dev_dbg(>pdev->dev,
+   "Send kill signal to pid %d\n",
+   ctx->pid);
+   kill_pid(find_vpid(ctx->pid), SIGKILL, 1);
+   }
}
}
spin_unlock(>ctx_lock);
@@ -1001,6 +1005,49 @@ static long bcm_vk_load_image(struct bcm_vk *vk,
return ret;
 }
 
+static int bcm_vk_reset_successful(struct bcm_vk *vk)
+{
+   struct device *dev = >pdev->dev;
+   u32 fw_status, reset_reason;
+   int ret = -EAGAIN;
+
+   /*
+* Reset could be triggered when the card in several state:
+*   i)   in bootROM
+*   ii)  after boot1
+*   iii) boot2 running
+*
+* i) & ii) - no status bits will be updated.  If vkboot1
+* runs automatically after reset, it  will update the reason
+* to be unknown reason
+* iii) - reboot reason match + deinit done.
+*/
+   fw_status = vkread32(vk, BAR_0, VK_BAR_FWSTS);
+   /* immediate exit if interface goes down */
+   if (BCM_VK_INTF_IS_DOWN(fw_status)) {
+   dev_err(dev, "PCIe Intf Down!\n");
+   goto reset_exit;
+   }
+
+   reset_reason = (fw_status & VK_FWSTS_RESET_REASON_MASK);
+   if ((reset_reason == VK_FWSTS_RESET_MBOX_DB) ||
+   (reset_reason == VK_FWSTS_RESET_UNKNOWN))
+   ret = 0;
+
+   /*
+* if some of the deinit bits are set, but done
+* bit is not, this is a failure if triggered while boot2 is running
+*/
+   if ((fw_status & VK_FWSTS_DEINIT_TRIGGERED) &&
+   !(fw_status & VK_FWSTS_RESET_DONE))
+   ret = -EAGAIN;
+
+reset_exit:
+   dev_dbg(dev, "FW status = 0x%x ret %d\n", fw_status, ret);
+
+   return ret;
+}
+
 static void bcm_to_v_reset_doorbell(struct bcm_vk *vk, u32 db_val)
 {
vkwrite32(vk, db_val, BAR_0, VK_BAR0_RESET_DB_BASE);
@@ -1010,12 +1057,16 @@ static int bcm_vk_trigger_reset(struct bcm_vk *vk)
 {
u32 i;
u32 value, boot_status;
+   bool is_stdalone, is_boot2;
static const u32 bar0_reg_clr_list[] = { BAR_OS_UPTIME,
 BAR_INTF_VER,
 BAR_CARD_VOLTAGE,
 BAR_CARD_TEMPERATURE,
 BAR_CARD_PWR_AND_THRE };
 
+   /* clean up before pressing the door bell */
+   bcm_vk_drain_msg_on_reset(vk);
+   vkwrite32(vk, 0, BAR_1, VK_BAR1_MSGQ_DEF_RDY);
/* make tag '\0' terminated */
vkwrite32(vk, 0, BAR_1, VK_BAR1_BOOT1_VER_TAG);
 
@@ -1026,6 +1077,11 @@ static int bcm_vk_trigger_reset(struct bcm_vk *vk)
for (i = 0; i < VK_BAR1_SOTP_REVID_MAX; i++)
vkwrite32(vk, 0, BAR_1, 

[PATCH v7 11/13] misc: bcm-vk: add mmap function for exposing BAR2

2020-11-17 Thread Scott Branden
Add mmap function that allows host application to open up BAR2 memory
for remote spooling out messages from the VK logger.

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk_dev.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index e572a7b18fab..cac07419f041 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -1199,6 +1199,29 @@ static long bcm_vk_reset(struct bcm_vk *vk, struct 
vk_reset __user *arg)
return ret;
 }
 
+static int bcm_vk_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct bcm_vk_ctx *ctx = file->private_data;
+   struct bcm_vk *vk = container_of(ctx->miscdev, struct bcm_vk, miscdev);
+   unsigned long pg_size;
+
+   /* only BAR2 is mmap possible, which is bar num 4 due to 64bit */
+#define VK_MMAPABLE_BAR 4
+
+   pg_size = ((pci_resource_len(vk->pdev, VK_MMAPABLE_BAR) - 1)
+   >> PAGE_SHIFT) + 1;
+   if (vma->vm_pgoff + vma_pages(vma) > pg_size)
+   return -EINVAL;
+
+   vma->vm_pgoff += (pci_resource_start(vk->pdev, VK_MMAPABLE_BAR)
+ >> PAGE_SHIFT);
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+   return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+ vma->vm_end - vma->vm_start,
+ vma->vm_page_prot);
+}
+
 static long bcm_vk_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
 {
long ret = -EINVAL;
@@ -1237,6 +1260,7 @@ static const struct file_operations bcm_vk_fops = {
.write = bcm_vk_write,
.poll = bcm_vk_poll,
.release = bcm_vk_release,
+   .mmap = bcm_vk_mmap,
.unlocked_ioctl = bcm_vk_ioctl,
 };
 
-- 
2.17.1



Re: [arm] BUG: KASAN: slab-out-of-bounds in memcmp+0x30/0x5c

2020-11-17 Thread Nishanth Menon
On 16:25-20201117, Arnd Bergmann wrote:
> On Tue, Nov 17, 2020 at 3:44 PM Naresh Kamboju
>  wrote:
> >
> > While booting arm KASAN config enabled kernel on TI x15 device
> > Linux version 5.10.0-rc3-next-20201116.
> >
> > The reported issue is not a regression since we have recently started 
> > testing
> > arm+kasan builds on LKFT.
> >
> > The boot was not successful on x15 and qemu_arm  for some other reason.
> > The kernel config and crash log attached to this email.
> 
> Nice find!
> 
> > [   13.071906] BUG: KASAN: slab-out-of-bounds in memcmp+0x30/0x5c
> > [   13.077526] Synopsys Designware Multimedia Card Interface Driver
> > [   13.077781] Read of size 1 at addr c5ae1d90 by task kworker/0:0/5
> > [   13.089918]
> > [   13.091433] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted
> > 5.10.0-rc3-next-20201116 #2
> > [   13.093605] sdhci-pltfm: SDHCI platform and OF driver helper
> > [   13.099135] Hardware name: Generic DRA74X (Flattened Device Tree)
> > [   13.110942] Workqueue: events dbs_work_handler
> > [   13.115442] [] (unwind_backtrace) from []
> > (show_stack+0x10/0x14)
> > [   13.123240] [] (show_stack) from []
> > (dump_stack+0xc8/0xe0)
> > [   13.130518] [] (dump_stack) from []
> > (print_address_description.constprop.0+0x34/0x2dc)
> > [   13.140238] [] (print_address_description.constprop.0)
> > from [] (kasan_report+0x1a8/0x1c4)
> > [   13.145871] omap_gpio 4805d000.gpio: Could not set line 27 debounce
> > to 20 microseconds (-22)
> > [   13.150221] [] (kasan_report) from [] 
> > (memcmp+0x30/0x5c)
> > [   13.159064] sdhci-omap 4809c000.mmc: Got CD GPIO
> > [   13.166123] [] (memcmp) from []
> > (ti_abb_set_voltage_sel+0x94/0x58c)
> > [   13.166150] [] (ti_abb_set_voltage_sel) from []
> > (_regulator_call_set_voltage_sel+0xd8/0x12c)
> 
> 
> I see this code in ti_abb_set_voltage_sel():
> 
> if (sel >= desc->n_voltages) {
> dev_err(dev, "%s: sel idx(%d) >= n_voltages(%d)\n", __func__,
> sel, desc->n_voltages);
> return -EINVAL;
> }
> 
> /* If we are in the same index as we were, nothing to do here! */
> if (sel == abb->current_info_idx) {
> dev_dbg(dev, "%s: Already at sel=%d\n", __func__, sel);
> return ret;
> }
> 
> /* If data is exactly the same, then just update index, no change */
> info = >info[sel];
> oinfo = >info[abb->current_info_idx];
> if (!memcmp(info, oinfo, sizeof(*info))) {
> 
> One of the two pointers overflows the abb->info array that is allocated
> with length 'desc->n_voltages'. The 'sel' argument is checked against
> that limit, so I assume it's abb->current_info_idx, and this is indeed
> initialized as
> 
> /* We do not know where the OPP voltage is at the moment */
> abb->current_info_idx = -EINVAL;
> 
> Using the negative '-EINVAL' as an array index would indeed cause
> an out-of-bounds access.
> 
> Could you try adding this extra bounds check?
> 
> index 3e60bff76194..c475a9461027 100644
> --- a/drivers/regulator/ti-abb-regulator.c
> +++ b/drivers/regulator/ti-abb-regulator.c
> @@ -345,7 +345,8 @@ static int ti_abb_set_voltage_sel(struct
> regulator_dev *rdev, unsigned sel)
> /* If data is exactly the same, then just update index, no change */
> info = >info[sel];
> oinfo = >info[abb->current_info_idx];
> -   if (!memcmp(info, oinfo, sizeof(*info))) {
> +   if (abb->current_info_idx >= 0 &&
> +   !memcmp(info, oinfo, sizeof(*info))) {
> dev_dbg(dev, "%s: Same data new idx=%d, old idx=%d\n", 
> __func__,
> sel, abb->current_info_idx);
> goto out;
> 
>   Arnd


Yes, this was indeed a bug that has been around for some time now :(

I tested with a variant of the above (did'nt like that
oinfo was being assigned an invalid address)
Boot log: https://pastebin.ubuntu.com/p/nZfz3HF8N6/ (with the same
config as in the report): Would you prefer to me to send the following
as a formal patch?

diff --git a/drivers/regulator/ti-abb-regulator.c 
b/drivers/regulator/ti-abb-regulator.c
index 3e60bff76194..9f0a4d50cead 100644
--- a/drivers/regulator/ti-abb-regulator.c
+++ b/drivers/regulator/ti-abb-regulator.c
@@ -342,8 +342,17 @@ static int ti_abb_set_voltage_sel(struct regulator_dev 
*rdev, unsigned sel)
return ret;
}
 
-   /* If data is exactly the same, then just update index, no change */
i

[PATCH v7 09/13] misc: bcm-vk: add VK messaging support

2020-11-17 Thread Scott Branden
Add message support in order to be able to communicate
to VK card via message queues.

This info is used for debug purposes via collection of logs via direct
read of BAR space and by sysfs access (in a follow on commit).

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/Makefile |3 +-
 drivers/misc/bcm-vk/bcm_vk.h |  123 
 drivers/misc/bcm-vk/bcm_vk_dev.c |  309 +++-
 drivers/misc/bcm-vk/bcm_vk_msg.c | 1187 ++
 drivers/misc/bcm-vk/bcm_vk_msg.h |  132 
 drivers/misc/bcm-vk/bcm_vk_sg.c  |  275 +++
 drivers/misc/bcm-vk/bcm_vk_sg.h  |   61 ++
 7 files changed, 2087 insertions(+), 3 deletions(-)
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_sg.c
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_sg.h

diff --git a/drivers/misc/bcm-vk/Makefile b/drivers/misc/bcm-vk/Makefile
index a2ae79858409..79b4e365c9e6 100644
--- a/drivers/misc/bcm-vk/Makefile
+++ b/drivers/misc/bcm-vk/Makefile
@@ -6,5 +6,6 @@
 obj-$(CONFIG_BCM_VK) += bcm_vk.o
 bcm_vk-objs := \
bcm_vk_dev.o \
-   bcm_vk_msg.o
+   bcm_vk_msg.o \
+   bcm_vk_sg.o
 
diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index 50f2a0cd6e13..d847a512d0ed 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -6,11 +6,13 @@
 #ifndef BCM_VK_H
 #define BCM_VK_H
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -93,14 +95,53 @@
 #define MAJOR_SOC_REV(_chip_id)(((_chip_id) >> 20) & 0xf)
 
 #define BAR_CARD_TEMPERATURE   0x45c
+/* defines for all temperature sensor */
+#define BCM_VK_TEMP_FIELD_MASK 0xff
+#define BCM_VK_CPU_TEMP_SHIFT  0
+#define BCM_VK_DDR0_TEMP_SHIFT 8
+#define BCM_VK_DDR1_TEMP_SHIFT 16
 
 #define BAR_CARD_VOLTAGE   0x460
+/* defines for voltage rail conversion */
+#define BCM_VK_VOLT_RAIL_MASK  0x
+#define BCM_VK_3P3_VOLT_REG_SHIFT  16
 
 #define BAR_CARD_ERR_LOG   0x464
+/* Error log register bit definition - register for error alerts */
+#define ERR_LOG_UECC   BIT(0)
+#define ERR_LOG_SSIM_BUSY  BIT(1)
+#define ERR_LOG_AFBC_BUSY  BIT(2)
+#define ERR_LOG_HIGH_TEMP_ERR  BIT(3)
+#define ERR_LOG_WDOG_TIMEOUT   BIT(4)
+#define ERR_LOG_SYS_FAULT  BIT(5)
+#define ERR_LOG_RAMDUMPBIT(6)
+#define ERR_LOG_COP_WDOG_TIMEOUT   BIT(7)
+/* warnings */
+#define ERR_LOG_MEM_ALLOC_FAIL BIT(8)
+#define ERR_LOG_LOW_TEMP_WARN  BIT(9)
+#define ERR_LOG_ECCBIT(10)
+#define ERR_LOG_IPC_DWNBIT(11)
+
+/* Alert bit definitions detectd on host */
+#define ERR_LOG_HOST_INTF_V_FAIL   BIT(13)
+#define ERR_LOG_HOST_HB_FAIL   BIT(14)
+#define ERR_LOG_HOST_PCIE_DWN  BIT(15)
 
 #define BAR_CARD_ERR_MEM   0x468
+/* defines for mem err, all fields have same width */
+#define BCM_VK_MEM_ERR_FIELD_MASK  0xff
+#define BCM_VK_ECC_MEM_ERR_SHIFT   0
+#define BCM_VK_UECC_MEM_ERR_SHIFT  8
+/* threshold of event occurrence and logs start to come out */
+#define BCM_VK_ECC_THRESHOLD   10
+#define BCM_VK_UECC_THRESHOLD  1
 
 #define BAR_CARD_PWR_AND_THRE  0x46c
+/* defines for power and temp threshold, all fields have same width */
+#define BCM_VK_PWR_AND_THRE_FIELD_MASK 0xff
+#define BCM_VK_LOW_TEMP_THRE_SHIFT 0
+#define BCM_VK_HIGH_TEMP_THRE_SHIFT8
+#define BCM_VK_PWR_STATE_SHIFT 16
 
 #define BAR_CARD_STATIC_INFO   0x470
 
@@ -143,6 +184,11 @@
 #define BAR_FIRMWARE_TAG_SIZE  50
 #define FIRMWARE_STATUS_PRE_INIT_DONE  0x1f
 
+/* VK MSG_ID defines */
+#define VK_MSG_ID_BITMAP_SIZE  4096
+#define VK_MSG_ID_BITMAP_MASK  (VK_MSG_ID_BITMAP_SIZE - 1)
+#define VK_MSG_ID_OVERFLOW 0x
+
 /*
  * BAR1
  */
@@ -197,6 +243,10 @@
 /* VK device supports a maximum of 3 bars */
 #define MAX_BAR3
 
+/* default number of msg blk for inband SGL */
+#define BCM_VK_DEF_IB_SGL_BLK_LEN   16
+#define BCM_VK_IB_SGL_BLK_MAX   24
+
 enum pci_barno {
BAR_0 = 0,
BAR_1,
@@ -267,9 +317,27 @@ struct bcm_vk_proc_mon_info {
struct bcm_vk_proc_mon_entry_t entries[BCM_VK_PROC_MON_MAX];
 };
 
+struct bcm_vk_hb_ctrl {
+   struct timer_list timer;
+   u32 last_uptime;
+   u32 lost_cnt;
+};
+
+struct bcm_vk_alert {
+   u16 flags;
+   u16 notfs;
+};
+
+/* some alert counters that the driver will keep track */
+struct bcm_vk_alert_cnts {
+   u16 ecc;
+   u16 uecc;
+};
+
 struct bcm_vk {
struct pci_dev *pdev;
void __iomem *bar[MAX_BAR];
+   int num_irqs;
 
struct bcm_vk_card_info card_info;
struct bcm_vk_proc_mon_info proc_mon_info;
@@ -283,9 +351,17 @@ struct bcm_vk {
/* Reference-counting to handle file 

Re: [PATCH net-next 4/4] ptp: ptp_ines: use enum ptp_msg_type

2020-11-17 Thread Vladimir Oltean
On Wed, Nov 18, 2020 at 07:17:41AM +0800, kernel test robot wrote:
> >> drivers/ptp/ptp_ines.c:690:26: error: conflicting types for 
> >> 'tag_to_msgtype'
>  690 | static enum ptp_msg_type tag_to_msgtype(u8 tag)
>  |  ^~
>drivers/ptp/ptp_ines.c:178:11: note: previous declaration of 
> 'tag_to_msgtype' was here
>  178 | static u8 tag_to_msgtype(u8 tag);
>  |   ^~

Wait for the patches to simmer a little bit more before resending. And
please make sure to create a cover letter when doing so.


Re: [PATCH v5 1/2] iommu/iova: Retry from last rb tree node if iova search fails

2020-11-17 Thread Will Deacon
On Wed, 30 Sep 2020 13:14:23 +0530, vji...@codeaurora.org wrote:
> When ever a new iova alloc request comes iova is always searched
> from the cached node and the nodes which are previous to cached
> node. So, even if there is free iova space available in the nodes
> which are next to the cached node iova allocation can still fail
> because of this approach.
> 
> Consider the following sequence of iova alloc and frees on
> 1GB of iova space
> 
> [...]

Applied to arm64 (for-next/iommu/iova), thanks!

[1/2] iommu/iova: Retry from last rb tree node if iova search fails
  https://git.kernel.org/arm64/c/4e89dce72521
[2/2] iommu/iova: Free global iova rcache on iova alloc failure
  https://git.kernel.org/arm64/c/6fa3525b455a

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


[PATCH v7 08/13] misc: bcm-vk: add get_card_info, peerlog_info, and proc_mon_info

2020-11-17 Thread Scott Branden
Add support to get card_info (details about card),
peerlog_info (to get details of peerlog on card),
and proc_mon_info (process monitoring on card).

This info is used for collection of logs via direct
read of BAR space and by sysfs access (in a follow on commit).

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk.h |  60 ++
 drivers/misc/bcm-vk/bcm_vk_dev.c | 105 +++
 2 files changed, 165 insertions(+)

diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index 726aab71bb6b..50f2a0cd6e13 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -205,6 +205,21 @@ enum pci_barno {
 
 #define BCM_VK_NUM_TTY 2
 
+/* VK device max power state, supports 3, full, reduced and low */
+#define MAX_OPP 3
+#define MAX_CARD_INFO_TAG_SIZE 64
+
+struct bcm_vk_card_info {
+   u32 version;
+   char os_tag[MAX_CARD_INFO_TAG_SIZE];
+   char cmpt_tag[MAX_CARD_INFO_TAG_SIZE];
+   u32 cpu_freq_mhz;
+   u32 cpu_scale[MAX_OPP];
+   u32 ddr_freq_mhz;
+   u32 ddr_size_MB;
+   u32 video_core_freq_mhz;
+};
+
 /* DAUTH related info */
 struct bcm_vk_dauth_key {
char store[VK_BAR1_DAUTH_STORE_SIZE];
@@ -215,10 +230,49 @@ struct bcm_vk_dauth_info {
struct bcm_vk_dauth_key keys[VK_BAR1_DAUTH_MAX];
 };
 
+/*
+ * Control structure of logging messages from the card.  This
+ * buffer is for logmsg that comes from vk
+ */
+struct bcm_vk_peer_log {
+   u32 rd_idx;
+   u32 wr_idx;
+   u32 buf_size;
+   u32 mask;
+   char data[0];
+};
+
+/* max buf size allowed */
+#define BCM_VK_PEER_LOG_BUF_MAX SZ_16K
+/* max size per line of peer log */
+#define BCM_VK_PEER_LOG_LINE_MAX  256
+
+/*
+ * single entry for processing type + utilization
+ */
+#define BCM_VK_PROC_TYPE_TAG_LEN 8
+struct bcm_vk_proc_mon_entry_t {
+   char tag[BCM_VK_PROC_TYPE_TAG_LEN];
+   u32 used;
+   u32 max; /**< max capacity */
+};
+
+/**
+ * Structure for run time utilization
+ */
+#define BCM_VK_PROC_MON_MAX 8 /* max entries supported */
+struct bcm_vk_proc_mon_info {
+   u32 num; /**< no of entries */
+   u32 entry_size; /**< per entry size */
+   struct bcm_vk_proc_mon_entry_t entries[BCM_VK_PROC_MON_MAX];
+};
+
 struct bcm_vk {
struct pci_dev *pdev;
void __iomem *bar[MAX_BAR];
 
+   struct bcm_vk_card_info card_info;
+   struct bcm_vk_proc_mon_info proc_mon_info;
struct bcm_vk_dauth_info dauth_info;
 
/* mutex to protect the ioctls */
@@ -240,6 +294,12 @@ struct bcm_vk {
dma_addr_t tdma_addr; /* test dma segment bus addr */
 
struct notifier_block panic_nb;
+
+   /* offset of the peer log control in BAR2 */
+   u32 peerlog_off;
+   struct bcm_vk_peer_log peerlog_info; /* record of peer log info */
+   /* offset of processing monitoring info in BAR2 */
+   u32 proc_mon_off;
 };
 
 /* wq offload work items bits definitions */
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index 203a1cf2bae3..a63208513c64 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -172,6 +172,104 @@ static inline int bcm_vk_wait(struct bcm_vk *vk, enum 
pci_barno bar,
return 0;
 }
 
+static void bcm_vk_get_card_info(struct bcm_vk *vk)
+{
+   struct device *dev = >pdev->dev;
+   u32 offset;
+   int i;
+   u8 *dst;
+   struct bcm_vk_card_info *info = >card_info;
+
+   /* first read the offset from spare register */
+   offset = vkread32(vk, BAR_0, BAR_CARD_STATIC_INFO);
+   offset &= (pci_resource_len(vk->pdev, BAR_2 * 2) - 1);
+
+   /* based on the offset, read info to internal card info structure */
+   dst = (u8 *)info;
+   for (i = 0; i < sizeof(*info); i++)
+   *dst++ = vkread8(vk, BAR_2, offset++);
+
+#define CARD_INFO_LOG_FMT "version   : %x\n" \
+ "os_tag: %s\n" \
+ "cmpt_tag  : %s\n" \
+ "cpu_freq  : %d MHz\n" \
+ "cpu_scale : %d full, %d lowest\n" \
+ "ddr_freq  : %d MHz\n" \
+ "ddr_size  : %d MB\n" \
+ "video_freq: %d MHz\n"
+   dev_dbg(dev, CARD_INFO_LOG_FMT, info->version, info->os_tag,
+   info->cmpt_tag, info->cpu_freq_mhz, info->cpu_scale[0],
+   info->cpu_scale[MAX_OPP - 1], info->ddr_freq_mhz,
+   info->ddr_size_MB, info->video_core_freq_mhz);
+
+   /*
+* get the peer log pointer, only need the offset, and get record
+* of the log buffer information which would be used for checking
+* before dump, in case the BAR2 memory has been corrupted.
+*/
+   vk->peerlog_off = offset;
+   memcpy_fromio(>peerlog_info, vk->bar[BAR_2] + vk->peerlog_off,
+ 

[PATCH v7 07/13] misc: bcm-vk: add ioctl load_image

2020-11-17 Thread Scott Branden
Add ioctl support to issue load_image operation to VK card.

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Co-developed-by: James Hu 
Signed-off-by: James Hu 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk.h |  3 +
 drivers/misc/bcm-vk/bcm_vk_dev.c | 95 
 2 files changed, 98 insertions(+)

diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index 5f0fcfdaf265..726aab71bb6b 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "bcm_vk_msg.h"
@@ -220,6 +221,8 @@ struct bcm_vk {
 
struct bcm_vk_dauth_info dauth_info;
 
+   /* mutex to protect the ioctls */
+   struct mutex mutex;
struct miscdevice miscdev;
int devid; /* dev id allocated */
 
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index 79fffb1e6f84..203a1cf2bae3 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -580,6 +581,71 @@ static void bcm_vk_wq_handler(struct work_struct *work)
}
 }
 
+static long bcm_vk_load_image(struct bcm_vk *vk,
+ const struct vk_image __user *arg)
+{
+   struct device *dev = >pdev->dev;
+   const char *image_name;
+   struct vk_image image;
+   u32 next_loadable;
+   enum soc_idx idx;
+   int image_idx;
+   int ret = -EPERM;
+
+   if (copy_from_user(, arg, sizeof(image)))
+   return -EACCES;
+
+   if ((image.type != VK_IMAGE_TYPE_BOOT1) &&
+   (image.type != VK_IMAGE_TYPE_BOOT2)) {
+   dev_err(dev, "invalid image.type %u\n", image.type);
+   return ret;
+   }
+
+   next_loadable = bcm_vk_next_boot_image(vk);
+   if (next_loadable != image.type) {
+   dev_err(dev, "Next expected image %u, Loading %u\n",
+   next_loadable, image.type);
+   return ret;
+   }
+
+   /*
+* if something is pending download already.  This could only happen
+* for now when the driver is being loaded, or if someone has issued
+* another download command in another shell.
+*/
+   if (test_and_set_bit(BCM_VK_WQ_DWNLD_PEND, vk->wq_offload) != 0) {
+   dev_err(dev, "Download operation already pending.\n");
+   return ret;
+   }
+
+   image_name = image.filename;
+   if (image_name[0] == '\0') {
+   /* Use default image name if NULL */
+   idx = get_soc_idx(vk);
+   if (idx == VK_IDX_INVALID)
+   goto err_idx;
+
+   /* Image idx starts with boot1 */
+   image_idx = image.type - VK_IMAGE_TYPE_BOOT1;
+   image_name = get_load_fw_name(vk, _tab[idx][image_idx]);
+   if (!image_name) {
+   dev_err(dev, "No suitable image found for type %d",
+   image.type);
+   ret = -ENOENT;
+   goto err_idx;
+   }
+   } else {
+   /* Ensure filename is NULL terminated */
+   image.filename[sizeof(image.filename) - 1] = '\0';
+   }
+   ret = bcm_vk_load_image_by_type(vk, image.type, image_name);
+   dev_info(dev, "Load %s, ret %d\n", image_name, ret);
+err_idx:
+   clear_bit(BCM_VK_WQ_DWNLD_PEND, vk->wq_offload);
+
+   return ret;
+}
+
 static void bcm_to_v_reset_doorbell(struct bcm_vk *vk, u32 db_val)
 {
vkwrite32(vk, db_val, BAR_0, VK_BAR0_RESET_DB_BASE);
@@ -636,10 +702,38 @@ static int bcm_vk_trigger_reset(struct bcm_vk *vk)
return 0;
 }
 
+static long bcm_vk_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
+{
+   long ret = -EINVAL;
+   struct bcm_vk_ctx *ctx = file->private_data;
+   struct bcm_vk *vk = container_of(ctx->miscdev, struct bcm_vk, miscdev);
+   void __user *argp = (void __user *)arg;
+
+   dev_dbg(>pdev->dev,
+   "ioctl, cmd=0x%02x, arg=0x%02lx\n",
+   cmd, arg);
+
+   mutex_lock(>mutex);
+
+   switch (cmd) {
+   case VK_IOCTL_LOAD_IMAGE:
+   ret = bcm_vk_load_image(vk, argp);
+   break;
+
+   default:
+   break;
+   }
+
+   mutex_unlock(>mutex);
+
+   return ret;
+}
+
 static const struct file_operations bcm_vk_fops = {
.owner = THIS_MODULE,
.open = bcm_vk_open,
.release = bcm_vk_release,
+   .unlocked_ioctl = bcm_vk_ioctl,
 };
 
 static int bcm_vk_on_panic(struct notifier_block *nb,
@@ -670,6 +764,7 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
return -ENOMEM;
 
kref_init(>kref);
+   mutex_init(>mutex);
 
err = pci_enable_device(pdev);
 

Re: [PATCH] iommu/vt-d: include conditionally on CONFIG_INTEL_IOMMU_SVM

2020-11-17 Thread Will Deacon
On Sun, 15 Nov 2020 21:59:51 +0100, Lukas Bulwahn wrote:
> Commit 6ee1b77ba3ac ("iommu/vt-d: Add svm/sva invalidate function")
> introduced intel_iommu_sva_invalidate() when CONFIG_INTEL_IOMMU_SVM.
> This function uses the dedicated static variable inv_type_granu_table
> and functions to_vtd_granularity() and to_vtd_size().
> 
> These parts are unused when !CONFIG_INTEL_IOMMU_SVM, and hence,
> make CC=clang W=1 warns with an -Wunused-function warning.
> 
> [...]

Applied to arm64 (for-next/iommu/vt-d), thanks!

[1/1] iommu/vt-d: include conditionally on CONFIG_INTEL_IOMMU_SVM
  https://git.kernel.org/arm64/c/68dd9d89eaf5

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


[PATCH v7 06/13] misc: bcm-vk: add open/release

2020-11-17 Thread Scott Branden
Add open/release to replace private data with context for other methods
to use.  Reason for the context is because it is allowed for multiple
sessions to open sysfs.  For each file open, when upper layer queries the
response, only those that are tied to a specified open should be returned.

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/Makefile |   4 +-
 drivers/misc/bcm-vk/bcm_vk.h |  15 
 drivers/misc/bcm-vk/bcm_vk_dev.c |  23 ++
 drivers/misc/bcm-vk/bcm_vk_msg.c | 127 +++
 drivers/misc/bcm-vk/bcm_vk_msg.h |  31 
 5 files changed, 199 insertions(+), 1 deletion(-)
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_msg.c
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_msg.h

diff --git a/drivers/misc/bcm-vk/Makefile b/drivers/misc/bcm-vk/Makefile
index f8a7ac4c242f..a2ae79858409 100644
--- a/drivers/misc/bcm-vk/Makefile
+++ b/drivers/misc/bcm-vk/Makefile
@@ -5,4 +5,6 @@
 
 obj-$(CONFIG_BCM_VK) += bcm_vk.o
 bcm_vk-objs := \
-   bcm_vk_dev.o
+   bcm_vk_dev.o \
+   bcm_vk_msg.o
+
diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index f428ad9a0c3d..5f0fcfdaf265 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -7,9 +7,14 @@
 #define BCM_VK_H
 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
+#include 
+
+#include "bcm_vk_msg.h"
 
 #define DRV_MODULE_NAME"bcm-vk"
 
@@ -218,6 +223,13 @@ struct bcm_vk {
struct miscdevice miscdev;
int devid; /* dev id allocated */
 
+   /* Reference-counting to handle file operations */
+   struct kref kref;
+
+   spinlock_t ctx_lock; /* Spinlock for component context */
+   struct bcm_vk_ctx ctx[VK_CMPT_CTX_MAX];
+   struct bcm_vk_ht_entry pid_ht[VK_PID_HT_SZ];
+
struct workqueue_struct *wq_thread;
struct work_struct wq_work; /* work queue for deferred job */
unsigned long wq_offload[1]; /* various flags on wq requested */
@@ -278,6 +290,9 @@ static inline bool bcm_vk_msgq_marker_valid(struct bcm_vk 
*vk)
return (rdy_marker == VK_BAR1_MSGQ_RDY_MARKER);
 }
 
+int bcm_vk_open(struct inode *inode, struct file *p_file);
+int bcm_vk_release(struct inode *inode, struct file *p_file);
+void bcm_vk_release_data(struct kref *kref);
 int bcm_vk_auto_load_all_images(struct bcm_vk *vk);
 
 #endif
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index 09d99bd36e8a..79fffb1e6f84 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -635,6 +636,12 @@ static int bcm_vk_trigger_reset(struct bcm_vk *vk)
return 0;
 }
 
+static const struct file_operations bcm_vk_fops = {
+   .owner = THIS_MODULE,
+   .open = bcm_vk_open,
+   .release = bcm_vk_release,
+};
+
 static int bcm_vk_on_panic(struct notifier_block *nb,
   unsigned long e, void *p)
 {
@@ -657,10 +664,13 @@ static int bcm_vk_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
struct miscdevice *misc_device;
u32 boot_status;
 
+   /* allocate vk structure which is tied to kref for freeing */
vk = kzalloc(sizeof(*vk), GFP_KERNEL);
if (!vk)
return -ENOMEM;
 
+   kref_init(>kref);
+
err = pci_enable_device(pdev);
if (err) {
dev_err(dev, "Cannot enable PCI device\n");
@@ -738,6 +748,7 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
err = -ENOMEM;
goto err_ida_remove;
}
+   misc_device->fops = _vk_fops,
 
err = misc_register(misc_device);
if (err) {
@@ -826,6 +837,16 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
return err;
 }
 
+void bcm_vk_release_data(struct kref *kref)
+{
+   struct bcm_vk *vk = container_of(kref, struct bcm_vk, kref);
+   struct pci_dev *pdev = vk->pdev;
+
+   dev_dbg(>dev, "BCM-VK:%d release data 0x%p\n", vk->devid, vk);
+   pci_dev_put(pdev);
+   kfree(vk);
+}
+
 static void bcm_vk_remove(struct pci_dev *pdev)
 {
int i;
@@ -869,6 +890,8 @@ static void bcm_vk_remove(struct pci_dev *pdev)
pci_release_regions(pdev);
pci_free_irq_vectors(pdev);
pci_disable_device(pdev);
+
+   kref_put(>kref, bcm_vk_release_data);
 }
 
 static void bcm_vk_shutdown(struct pci_dev *pdev)
diff --git a/drivers/misc/bcm-vk/bcm_vk_msg.c b/drivers/misc/bcm-vk/bcm_vk_msg.c
new file mode 100644
index ..2d9a6b4e5f61
--- /dev/null
+++ b/drivers/misc/bcm-vk/bcm_vk_msg.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018-2020 Broadcom.
+ */
+
+#include "bcm_vk.h"
+#include "bcm_vk_msg.h"
+
+/*
+ * allocate a ctx per file struct
+ */
+static struct 

Re: [PATCH] iommu: Modify the description of iommu_sva_unbind_device

2020-11-17 Thread Will Deacon
On Fri, 23 Oct 2020 06:48:27 +, Chen Jun wrote:
> iommu_sva_unbind_device has no return value.
> 
> Remove the description of the return value of the function.

Applied to arm64 (for-next/iommu/misc), thanks!

[1/1] iommu: Modify the description of iommu_sva_unbind_device
  https://git.kernel.org/arm64/c/6243f572a18d

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


[PATCH v7 05/13] misc: bcm-vk: add triggers when host panic or reboots to notify card

2020-11-17 Thread Scott Branden
Pass down an interrupt to card in case of panic or reboot so
that card can take appropriate action to perform a clean reset.
Uses kernel notifier block either directly (register on panic list),
or implicitly (add shutdown method for PCI device).

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk.h |  2 ++
 drivers/misc/bcm-vk/bcm_vk_dev.c | 29 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index 0a366db693c8..f428ad9a0c3d 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -223,6 +223,8 @@ struct bcm_vk {
unsigned long wq_offload[1]; /* various flags on wq requested */
void *tdma_vaddr; /* test dma segment virtual addr */
dma_addr_t tdma_addr; /* test dma segment bus addr */
+
+   struct notifier_block panic_nb;
 };
 
 /* wq offload work items bits definitions */
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index 4ecd5b5f80d3..09d99bd36e8a 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -635,6 +635,16 @@ static int bcm_vk_trigger_reset(struct bcm_vk *vk)
return 0;
 }
 
+static int bcm_vk_on_panic(struct notifier_block *nb,
+  unsigned long e, void *p)
+{
+   struct bcm_vk *vk = container_of(nb, struct bcm_vk, panic_nb);
+
+   bcm_to_v_reset_doorbell(vk, VK_BAR0_RESET_DB_HARD);
+
+   return 0;
+}
+
 static int bcm_vk_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
int err;
@@ -748,6 +758,15 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
/* sync other info */
bcm_vk_sync_card_info(vk);
 
+   /* register for panic notifier */
+   vk->panic_nb.notifier_call = bcm_vk_on_panic;
+   err = atomic_notifier_chain_register(_notifier_list,
+>panic_nb);
+   if (err) {
+   dev_err(dev, "Fail to register panic notifier\n");
+   goto err_destroy_workqueue;
+   }
+
/*
 * lets trigger an auto download.  We don't want to do it serially here
 * because at probing time, it is not supposed to block for a long time.
@@ -756,7 +775,7 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (auto_load) {
if ((boot_status & BOOT_STATE_MASK) == BROM_RUNNING) {
if (bcm_vk_trigger_autoload(vk))
-   goto err_destroy_workqueue;
+   goto err_unregister_panic_notifier;
} else {
dev_err(dev,
"Auto-load skipped - BROM not in proper state 
(0x%x)\n",
@@ -768,6 +787,10 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
return 0;
 
+err_unregister_panic_notifier:
+   atomic_notifier_chain_unregister(_notifier_list,
+>panic_nb);
+
 err_destroy_workqueue:
destroy_workqueue(vk->wq_thread);
 
@@ -818,6 +841,10 @@ static void bcm_vk_remove(struct pci_dev *pdev)
bcm_vk_trigger_reset(vk);
usleep_range(BCM_VK_UCODE_BOOT_US, BCM_VK_UCODE_BOOT_MAX_US);
 
+   /* unregister panic notifier */
+   atomic_notifier_chain_unregister(_notifier_list,
+>panic_nb);
+
if (vk->tdma_vaddr)
dma_free_coherent(>dev, nr_scratch_pages * PAGE_SIZE,
  vk->tdma_vaddr, vk->tdma_addr);
-- 
2.17.1



[PATCH v7 04/13] misc: bcm-vk: add misc device to Broadcom VK driver

2020-11-17 Thread Scott Branden
Add misc device base support to create and remove devnode.
Additional misc functions for open/read/write/release/ioctl/sysfs, etc
will be added in follow on commits to allow for individual review.

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk.h |  2 ++
 drivers/misc/bcm-vk/bcm_vk_dev.c | 36 +++-
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index c4fb61a84e41..0a366db693c8 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -7,6 +7,7 @@
 #define BCM_VK_H
 
 #include 
+#include 
 #include 
 #include 
 
@@ -214,6 +215,7 @@ struct bcm_vk {
 
struct bcm_vk_dauth_info dauth_info;
 
+   struct miscdevice miscdev;
int devid; /* dev id allocated */
 
struct workqueue_struct *wq_thread;
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index adc3103c7012..4ecd5b5f80d3 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -643,6 +644,7 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
char name[20];
struct bcm_vk *vk;
struct device *dev = >dev;
+   struct miscdevice *misc_device;
u32 boot_status;
 
vk = kzalloc(sizeof(*vk), GFP_KERNEL);
@@ -719,6 +721,19 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
vk->devid = id;
snprintf(name, sizeof(name), DRV_MODULE_NAME ".%d", id);
+   misc_device = >miscdev;
+   misc_device->minor = MISC_DYNAMIC_MINOR;
+   misc_device->name = kstrdup(name, GFP_KERNEL);
+   if (!misc_device->name) {
+   err = -ENOMEM;
+   goto err_ida_remove;
+   }
+
+   err = misc_register(misc_device);
+   if (err) {
+   dev_err(dev, "failed to register device\n");
+   goto err_kfree_name;
+   }
 
INIT_WORK(>wq_work, bcm_vk_wq_handler);
 
@@ -727,7 +742,7 @@ static int bcm_vk_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (!vk->wq_thread) {
dev_err(dev, "Fail to create workqueue thread\n");
err = -ENOMEM;
-   goto err_ida_remove;
+   goto err_misc_deregister;
}
 
/* sync other info */
@@ -749,11 +764,20 @@ static int bcm_vk_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
}
}
 
+   dev_dbg(dev, "BCM-VK:%u created\n", id);
+
return 0;
 
 err_destroy_workqueue:
destroy_workqueue(vk->wq_thread);
 
+err_misc_deregister:
+   misc_deregister(misc_device);
+
+err_kfree_name:
+   kfree(misc_device->name);
+   misc_device->name = NULL;
+
 err_ida_remove:
ida_simple_remove(_vk_ida, id);
 
@@ -783,6 +807,7 @@ static void bcm_vk_remove(struct pci_dev *pdev)
 {
int i;
struct bcm_vk *vk = pci_get_drvdata(pdev);
+   struct miscdevice *misc_device = >miscdev;
 
/*
 * Trigger a reset to card and wait enough time for UCODE to rerun,
@@ -797,6 +822,13 @@ static void bcm_vk_remove(struct pci_dev *pdev)
dma_free_coherent(>dev, nr_scratch_pages * PAGE_SIZE,
  vk->tdma_vaddr, vk->tdma_addr);
 
+   /* remove if name is set which means misc dev registered */
+   if (misc_device->name) {
+   misc_deregister(misc_device);
+   kfree(misc_device->name);
+   ida_simple_remove(_vk_ida, vk->devid);
+   }
+
cancel_work_sync(>wq_work);
destroy_workqueue(vk->wq_thread);
 
@@ -805,6 +837,8 @@ static void bcm_vk_remove(struct pci_dev *pdev)
pci_iounmap(pdev, vk->bar[i]);
}
 
+   dev_dbg(>dev, "BCM-VK:%d released\n", vk->devid);
+
pci_release_regions(pdev);
pci_free_irq_vectors(pdev);
pci_disable_device(pdev);
-- 
2.17.1



[PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle

2020-11-17 Thread Joel Fernandes (Google)
During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

This resolves several performance issues that were seen in ChromeOS
audio usecase.

NOTE: Note, this patch will be improved in a later patch. It is just
  kept here as the basis for the later patch and to make rebasing
  easier. Further, it may make reverting the improvement easier in
  case the improvement causes any regression.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 33 -
 kernel/sched/fair.c  | 40 
 kernel/sched/sched.h |  5 +
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52d0e83072a4..4ee4902c2cf5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-   u64 vruntime = b->se.vruntime;
-
-   /*
-* Normalize the vruntime if tasks are in different cpus.
-*/
-   if (task_cpu(a) != task_cpu(b)) {
-   vruntime -= task_cfs_rq(b)->min_vruntime;
-   vruntime += task_cfs_rq(a)->min_vruntime;
-   }
-
-   return !((s64)(a->se.vruntime - vruntime) <= 0);
-   }
+   if (pa == MAX_RT_PRIO + MAX_NICE)   /* fair */
+   return cfs_prio_less(a, b);
 
return false;
 }
@@ -5144,6 +5133,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
+   bool fi_before = false;
bool need_sync;
int i, j, cpu;
 
@@ -5208,6 +5198,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
+   fi_before = true;
rq->core->core_forceidle = false;
}
for_each_cpu(i, smt_mask) {
@@ -5219,6 +5210,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
update_rq_clock(rq_i);
}
 
+   /* Reset the snapshot if core is no longer in force-idle. */
+   if (!fi_before) {
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+   rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+   }
+   }
+
/*
 * Try and select tasks for each sibling in decending sched_class
 * order.
@@ -5355,6 +5354,14 @@ next_class:;
resched_curr(rq_i);
}
 
+   /* Snapshot if core is in force-idle. */
+   if (!fi_before && rq->core->core_forceidle) {
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+   rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+   }
+   }
+
 done:
set_next_task(rq, next);
return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42965c4fd71f..de82f88ba98c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10726,6 +10726,46 @@ static inline void task_tick_core(struct rq *rq, 
struct task_struct *curr)
__entity_slice_used(>se, MIN_NR_TASKS_DURING_FORCEIDLE))
resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+   bool samecpu = task_cpu(a) == task_cpu(b);
+   struct sched_entity *sea = >se;
+   struct sched_entity *seb = >se;
+   struct cfs_rq *cfs_rqa;
+   struct cfs_rq *cfs_rqb;
+   s64 delta;
+
+   if (samecpu) {
+   /* vruntime is per cfs_rq */
+   while (!is_same_group(sea, seb)) {
+   int sea_depth = sea->depth;
+   int seb_depth = seb->depth;
+   if (sea_depth >= seb_depth)
+   sea = parent_entity(sea);
+   if (sea_depth <= seb_depth)
+   seb = parent_entity(seb);
+   }
+
+   delta = (s64)(sea->vruntime - 

[PATCH v7 03/13] misc: bcm-vk: add autoload support

2020-11-17 Thread Scott Branden
Add support to load and boot images on card automatically.
The kernel module parameter auto_load can be passed in as false to disable
such support on probe.
As well, nr_scratch_pages can be specified to allocate more or less scratch
memory on init as needed for desired card operation.

Co-developed-by: Desmond Yan 
Signed-off-by: Desmond Yan 
Co-developed-by: James Hu 
Signed-off-by: James Hu 
Signed-off-by: Scott Branden 
---
 drivers/misc/bcm-vk/bcm_vk.h | 250 +++
 drivers/misc/bcm-vk/bcm_vk_dev.c | 723 +++
 2 files changed, 973 insertions(+)

diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
index 9152785199ab..c4fb61a84e41 100644
--- a/drivers/misc/bcm-vk/bcm_vk.h
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -6,10 +6,187 @@
 #ifndef BCM_VK_H
 #define BCM_VK_H
 
+#include 
 #include 
+#include 
 
 #define DRV_MODULE_NAME"bcm-vk"
 
+/*
+ * Load Image is completed in two stages:
+ *
+ * 1) When the VK device boot-up, M7 CPU runs and executes the BootROM.
+ * The Secure Boot Loader (SBL) as part of the BootROM will run
+ * to open up ITCM for host to push BOOT1 image.
+ * SBL will authenticate the image before jumping to BOOT1 image.
+ *
+ * 2) Because BOOT1 image is a secured image, we also called it the
+ * Secure Boot Image (SBI). At second stage, SBI will initialize DDR
+ * and wait for host to push BOOT2 image to DDR.
+ * SBI will authenticate the image before jumping to BOOT2 image.
+ *
+ */
+/* Location of registers of interest in BAR0 */
+
+/* Request register for Secure Boot Loader (SBL) download */
+#define BAR_CODEPUSH_SBL   0x400
+/* Start of ITCM */
+#define CODEPUSH_BOOT1_ENTRY   0x0040
+#define CODEPUSH_MASK  0xf000
+#define CODEPUSH_BOOTSTART BIT(0)
+
+/* Boot Status register */
+#define BAR_BOOT_STATUS0x404
+
+#define SRAM_OPEN  BIT(16)
+#define DDR_OPEN   BIT(17)
+
+/* Firmware loader progress status definitions */
+#define FW_LOADER_ACK_SEND_MORE_DATA   BIT(18)
+#define FW_LOADER_ACK_IN_PROGRESS  BIT(19)
+#define FW_LOADER_ACK_RCVD_ALL_DATABIT(20)
+
+/* Boot1/2 is running in standalone mode */
+#define BOOT_STDALONE_RUNNING  BIT(21)
+
+/* definitions for boot status register */
+#define BOOT_STATE_MASK(0x & \
+~(FW_LOADER_ACK_SEND_MORE_DATA | \
+  FW_LOADER_ACK_IN_PROGRESS | \
+  BOOT_STDALONE_RUNNING))
+
+#define BOOT_ERR_SHIFT 4
+#define BOOT_ERR_MASK  (0xf << BOOT_ERR_SHIFT)
+#define BOOT_PROG_MASK 0xf
+
+#define BROM_STATUS_NOT_RUN0x2
+#define BROM_NOT_RUN   (SRAM_OPEN | BROM_STATUS_NOT_RUN)
+#define BROM_STATUS_COMPLETE   0x6
+#define BROM_RUNNING   (SRAM_OPEN | BROM_STATUS_COMPLETE)
+#define BOOT1_STATUS_COMPLETE  0x6
+#define BOOT1_RUNNING  (DDR_OPEN | BOOT1_STATUS_COMPLETE)
+#define BOOT2_STATUS_COMPLETE  0x6
+#define BOOT2_RUNNING  (FW_LOADER_ACK_RCVD_ALL_DATA | \
+BOOT2_STATUS_COMPLETE)
+
+/* Boot request for Secure Boot Image (SBI) */
+#define BAR_CODEPUSH_SBI   0x408
+/* 64M mapped to BAR2 */
+#define CODEPUSH_BOOT2_ENTRY   0x6000
+
+#define BAR_CARD_STATUS0x410
+
+#define BAR_BOOT1_STDALONE_PROGRESS0x420
+#define BOOT1_STDALONE_SUCCESS (BIT(13) | BIT(14))
+#define BOOT1_STDALONE_PROGRESS_MASK   BOOT1_STDALONE_SUCCESS
+
+#define BAR_METADATA_VERSION   0x440
+#define BAR_OS_UPTIME  0x444
+#define BAR_CHIP_ID0x448
+#define MAJOR_SOC_REV(_chip_id)(((_chip_id) >> 20) & 0xf)
+
+#define BAR_CARD_TEMPERATURE   0x45c
+
+#define BAR_CARD_VOLTAGE   0x460
+
+#define BAR_CARD_ERR_LOG   0x464
+
+#define BAR_CARD_ERR_MEM   0x468
+
+#define BAR_CARD_PWR_AND_THRE  0x46c
+
+#define BAR_CARD_STATIC_INFO   0x470
+
+#define BAR_INTF_VER   0x47c
+#define BAR_INTF_VER_MAJOR_SHIFT   16
+#define BAR_INTF_VER_MASK  0x
+/*
+ * major and minor semantic version numbers supported
+ * Please update as required on interface changes
+ */
+#define SEMANTIC_MAJOR 1
+#define SEMANTIC_MINOR 0
+
+/*
+ * first door bell reg, ie for queue = 0.  Only need the first one, as
+ * we will use the queue number to derive the others
+ */
+#define VK_BAR0_REGSEG_DB_BASE 0x484
+#define VK_BAR0_REGSEG_DB_REG_GAP  8 /*
+  * DB register gap,
+  * DB1 at 0x48c and DB2 at 0x494
+  */
+
+/* reset register and specific 

[PATCH v7 02/13] misc: bcm-vk: add Broadcom VK driver

2020-11-17 Thread Scott Branden
Add initial version of Broadcom VK driver to enumerate PCI device IDs
of Valkyrie and Viper device IDs.

VK based cards provide real-time high performance, high throughput,
low latency offload compute engine operations.
They are used for multiple parallel offload tasks as:
audio, video and image processing and crypto operations.

Further commits add additional features to driver beyond probe/remove.

Signed-off-by: Scott Branden 
---
 drivers/misc/Kconfig |   1 +
 drivers/misc/Makefile|   1 +
 drivers/misc/bcm-vk/Kconfig  |  15 
 drivers/misc/bcm-vk/Makefile |   8 ++
 drivers/misc/bcm-vk/bcm_vk.h |  29 +++
 drivers/misc/bcm-vk/bcm_vk_dev.c | 141 +++
 6 files changed, 195 insertions(+)
 create mode 100644 drivers/misc/bcm-vk/Kconfig
 create mode 100644 drivers/misc/bcm-vk/Makefile
 create mode 100644 drivers/misc/bcm-vk/bcm_vk.h
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_dev.c

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index fafa8b0d8099..591903773a6d 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -478,6 +478,7 @@ source "drivers/misc/genwqe/Kconfig"
 source "drivers/misc/echo/Kconfig"
 source "drivers/misc/cxl/Kconfig"
 source "drivers/misc/ocxl/Kconfig"
+source "drivers/misc/bcm-vk/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
 source "drivers/misc/habanalabs/Kconfig"
 source "drivers/misc/uacce/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index d23231e73330..54f2fe2d9448 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_ECHO)+= echo/
 obj-$(CONFIG_CXL_BASE) += cxl/
 obj-$(CONFIG_PCI_ENDPOINT_TEST)+= pci_endpoint_test.o
 obj-$(CONFIG_OCXL) += ocxl/
+obj-$(CONFIG_BCM_VK)   += bcm-vk/
 obj-y  += cardreader/
 obj-$(CONFIG_PVPANIC)  += pvpanic.o
 obj-$(CONFIG_HABANA_AI)+= habanalabs/
diff --git a/drivers/misc/bcm-vk/Kconfig b/drivers/misc/bcm-vk/Kconfig
new file mode 100644
index ..2272e47655ed
--- /dev/null
+++ b/drivers/misc/bcm-vk/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Broadcom VK device
+#
+config BCM_VK
+   tristate "Support for Broadcom VK Accelerators"
+   depends on PCI_MSI
+   help
+ Select this option to enable support for Broadcom
+ VK Accelerators.  VK is used for performing
+ specific offload processing.
+ This driver enables userspace programs to access these
+ accelerators via /dev/bcm-vk.N devices.
+
+ If unsure, say N.
diff --git a/drivers/misc/bcm-vk/Makefile b/drivers/misc/bcm-vk/Makefile
new file mode 100644
index ..f8a7ac4c242f
--- /dev/null
+++ b/drivers/misc/bcm-vk/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for Broadcom VK driver
+#
+
+obj-$(CONFIG_BCM_VK) += bcm_vk.o
+bcm_vk-objs := \
+   bcm_vk_dev.o
diff --git a/drivers/misc/bcm-vk/bcm_vk.h b/drivers/misc/bcm-vk/bcm_vk.h
new file mode 100644
index ..9152785199ab
--- /dev/null
+++ b/drivers/misc/bcm-vk/bcm_vk.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2018-2020 Broadcom.
+ */
+
+#ifndef BCM_VK_H
+#define BCM_VK_H
+
+#include 
+
+#define DRV_MODULE_NAME"bcm-vk"
+
+/* VK device supports a maximum of 3 bars */
+#define MAX_BAR3
+
+enum pci_barno {
+   BAR_0 = 0,
+   BAR_1,
+   BAR_2
+};
+
+#define BCM_VK_NUM_TTY 2
+
+struct bcm_vk {
+   struct pci_dev *pdev;
+   void __iomem *bar[MAX_BAR];
+};
+
+#endif
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
new file mode 100644
index ..14afe2477b97
--- /dev/null
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018-2020 Broadcom.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "bcm_vk.h"
+
+#define PCI_DEVICE_ID_VALKYRIE 0x5e87
+#define PCI_DEVICE_ID_VIPER0x5e88
+
+/* MSIX usages */
+#define VK_MSIX_MSGQ_MAX   3
+#define VK_MSIX_NOTF_MAX   1
+#define VK_MSIX_TTY_MAXBCM_VK_NUM_TTY
+#define VK_MSIX_IRQ_MAX(VK_MSIX_MSGQ_MAX + 
VK_MSIX_NOTF_MAX + \
+VK_MSIX_TTY_MAX)
+#define VK_MSIX_IRQ_MIN_REQ (VK_MSIX_MSGQ_MAX + VK_MSIX_NOTF_MAX)
+
+/* Number of bits set in DMA mask*/
+#define BCM_VK_DMA_BITS64
+
+static int bcm_vk_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+   int err;
+   int i;
+   int irq;
+   struct bcm_vk *vk;
+   struct device *dev = >dev;
+
+   vk = kzalloc(sizeof(*vk), GFP_KERNEL);
+   if (!vk)
+   return -ENOMEM;
+
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(dev, "Cannot enable PCI device\n");
+   

[PATCH v7 01/13] bcm-vk: add bcm_vk UAPI

2020-11-17 Thread Scott Branden
Add user space api for bcm-vk driver.

Provide ioctl api to load images and issue reset command to card.
FW status registers in PCI BAR space also defined as part
of API so that user space is able to interpret these memory locations
as needed via direct PCIe access.

Signed-off-by: Scott Branden 
---
 include/uapi/linux/misc/bcm_vk.h | 84 
 1 file changed, 84 insertions(+)
 create mode 100644 include/uapi/linux/misc/bcm_vk.h

diff --git a/include/uapi/linux/misc/bcm_vk.h b/include/uapi/linux/misc/bcm_vk.h
new file mode 100644
index ..ec28e0bd46a9
--- /dev/null
+++ b/include/uapi/linux/misc/bcm_vk.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright 2018-2020 Broadcom.
+ */
+
+#ifndef __UAPI_LINUX_MISC_BCM_VK_H
+#define __UAPI_LINUX_MISC_BCM_VK_H
+
+#include 
+#include 
+
+#define BCM_VK_MAX_FILENAME 64
+
+struct vk_image {
+   __u32 type; /* Type of image */
+#define VK_IMAGE_TYPE_BOOT1 1 /* 1st stage (load to SRAM) */
+#define VK_IMAGE_TYPE_BOOT2 2 /* 2nd stage (load to DDR) */
+   __u8 filename[BCM_VK_MAX_FILENAME]; /* Filename of image */
+};
+
+struct vk_reset {
+   __u32 arg1;
+   __u32 arg2;
+};
+
+#define VK_MAGIC   0x5e
+
+/* Load image to Valkyrie */
+#define VK_IOCTL_LOAD_IMAGE_IOW(VK_MAGIC, 0x2, struct vk_image)
+
+/* Send Reset to Valkyrie */
+#define VK_IOCTL_RESET _IOW(VK_MAGIC, 0x4, struct vk_reset)
+
+/*
+ * Firmware Status accessed directly via BAR space
+ */
+#define VK_BAR_FWSTS   0x41c
+#define VK_BAR_COP_FWSTS   0x428
+/* VK_FWSTS definitions */
+#define VK_FWSTS_RELOCATION_ENTRY  (1UL << 0)
+#define VK_FWSTS_RELOCATION_EXIT   (1UL << 1)
+#define VK_FWSTS_INIT_START(1UL << 2)
+#define VK_FWSTS_ARCH_INIT_DONE(1UL << 3)
+#define VK_FWSTS_PRE_KNL1_INIT_DONE(1UL << 4)
+#define VK_FWSTS_PRE_KNL2_INIT_DONE(1UL << 5)
+#define VK_FWSTS_POST_KNL_INIT_DONE(1UL << 6)
+#define VK_FWSTS_INIT_DONE (1UL << 7)
+#define VK_FWSTS_APP_INIT_START(1UL << 8)
+#define VK_FWSTS_APP_INIT_DONE (1UL << 9)
+#define VK_FWSTS_MASK  0x
+#define VK_FWSTS_READY (VK_FWSTS_INIT_START | \
+VK_FWSTS_ARCH_INIT_DONE | \
+VK_FWSTS_PRE_KNL1_INIT_DONE | \
+VK_FWSTS_PRE_KNL2_INIT_DONE | \
+VK_FWSTS_POST_KNL_INIT_DONE | \
+VK_FWSTS_INIT_DONE | \
+VK_FWSTS_APP_INIT_START | \
+VK_FWSTS_APP_INIT_DONE)
+/* Deinit */
+#define VK_FWSTS_APP_DEINIT_START  (1UL << 23)
+#define VK_FWSTS_APP_DEINIT_DONE   (1UL << 24)
+#define VK_FWSTS_DRV_DEINIT_START  (1UL << 25)
+#define VK_FWSTS_DRV_DEINIT_DONE   (1UL << 26)
+#define VK_FWSTS_RESET_DONE(1UL << 27)
+#define VK_FWSTS_DEINIT_TRIGGERED  (VK_FWSTS_APP_DEINIT_START | \
+VK_FWSTS_APP_DEINIT_DONE  | \
+VK_FWSTS_DRV_DEINIT_START | \
+VK_FWSTS_DRV_DEINIT_DONE)
+/* Last nibble for reboot reason */
+#define VK_FWSTS_RESET_REASON_SHIFT28
+#define VK_FWSTS_RESET_REASON_MASK (0xf << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_SYS_PWRUP   (0x0 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_MBOX_DB (0x1 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_M7_WDOG (0x2 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_TEMP(0x3 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_PCI_FLR (0x4 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_PCI_HOT (0x5 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_PCI_WARM(0x6 << 
VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_PCI_COLD(0x7 << 
VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_L1  (0x8 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_L0  (0x9 << VK_FWSTS_RESET_REASON_SHIFT)
+#define VK_FWSTS_RESET_UNKNOWN (0xf << VK_FWSTS_RESET_REASON_SHIFT)
+
+#endif /* __UAPI_LINUX_MISC_BCM_VK_H */
-- 
2.17.1



[PATCH v7 00/13] Add Broadcom VK driver

2020-11-17 Thread Scott Branden
This patch series drops previous patches in [1]
that were incorporated by Kees Cook into patch series
"Introduce partial kernel_read_file() support" [2].

Remaining patches are contained in this series to add Broadcom VK driver.
(which depends on request_firmware_into_buf API addition which has
now been accepted into the upstream kernel as of v5.10-rc1).

[1] 
https://lore.kernel.org/lkml/20200706232309.12010-1-scott.bran...@broadcom.com/
[2] https://lore.kernel.org/lkml/20201002173828.2099543-1-keesc...@chromium.org/

Changes from v6:
 - drop QSTATS patch as it needs to be reviewed if trace_printk makes sense
 - add wdog and IPC interface alerts
 - add boundary check to msgq and peerlog
 - clear additional registers on reset
Changes from v5:
 - dropped sysfs patch from series for now as rework to use hwmon
 - tty patch still at end of series to drop if another solution available
 - updated cover letter commit to point to Kees' latest patch submission in [2]
 - specified --base with Kees' patches applied (kernel branches don't have 
these yet)
 - removed trivial comment
 - moved location of const to before the struct in two declarations
 - changed dev_info to dev_warn and only print when irq don't match expected
 - changed dev_info to dev_dbg when printing debug QSTATS
 - removed unnecessary %p print
Changes from v4:
 - fixed memory leak in probe function on failure
 - changed -1 to -EBUSY in bcm_vk_tty return code
 - move bcm_vk_tty patch to end of patch series so it
   can be dropped from current patch series if needed
   and rearchitected if needed.
Changes from v3:
 - split driver into more incremental commits for acceptance/review
 - lowered some dev_info to dev_dbg
 - remove ANSI stdint types and replace with linux u8, etc types
 - changed an EIO return to EPFNOSUPPORT
 - move vk_msg_cmd internal to driver to not expose to UAPI at this time
Changes from v2:
 - open code BIT macro in uapi header
 - A0/B0 boot improvements
Changes from v1:
 - declare bcm_vk_intf_ver_chk as static

Scott Branden (13):
  bcm-vk: add bcm_vk UAPI
  misc: bcm-vk: add Broadcom VK driver
  misc: bcm-vk: add autoload support
  misc: bcm-vk: add misc device to Broadcom VK driver
  misc: bcm-vk: add triggers when host panic or reboots to notify card
  misc: bcm-vk: add open/release
  misc: bcm-vk: add ioctl load_image
  misc: bcm-vk: add get_card_info, peerlog_info, and proc_mon_info
  misc: bcm-vk: add VK messaging support
  misc: bcm-vk: reset_pid support
  misc: bcm-vk: add mmap function for exposing BAR2
  MAINTAINERS: bcm-vk: add maintainer for Broadcom VK Driver
  misc: bcm-vk: add ttyVK support

 MAINTAINERS  |7 +
 drivers/misc/Kconfig |1 +
 drivers/misc/Makefile|1 +
 drivers/misc/bcm-vk/Kconfig  |   15 +
 drivers/misc/bcm-vk/Makefile |   12 +
 drivers/misc/bcm-vk/bcm_vk.h |  513 ++
 drivers/misc/bcm-vk/bcm_vk_dev.c | 1651 ++
 drivers/misc/bcm-vk/bcm_vk_msg.c | 1350 
 drivers/misc/bcm-vk/bcm_vk_msg.h |  163 +++
 drivers/misc/bcm-vk/bcm_vk_sg.c  |  275 +
 drivers/misc/bcm-vk/bcm_vk_sg.h  |   61 ++
 drivers/misc/bcm-vk/bcm_vk_tty.c |  333 ++
 include/uapi/linux/misc/bcm_vk.h |   84 ++
 13 files changed, 4466 insertions(+)
 create mode 100644 drivers/misc/bcm-vk/Kconfig
 create mode 100644 drivers/misc/bcm-vk/Makefile
 create mode 100644 drivers/misc/bcm-vk/bcm_vk.h
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_dev.c
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_msg.c
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_msg.h
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_sg.c
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_sg.h
 create mode 100644 drivers/misc/bcm-vk/bcm_vk_tty.c
 create mode 100644 include/uapi/linux/misc/bcm_vk.h

-- 
2.17.1



[PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case

2020-11-17 Thread Joel Fernandes (Google)
From: Vineeth Pillai 

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 15 ---
 kernel/sched/fair.c  | 40 
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bd0b0bbb040..52d0e83072a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5206,16 +5206,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* reset state */
rq->core->core_cookie = 0UL;
+   if (rq->core->core_forceidle) {
+   need_sync = true;
+   rq->core->core_forceidle = false;
+   }
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
rq_i->core_pick = NULL;
 
-   if (rq_i->core_forceidle) {
-   need_sync = true;
-   rq_i->core_forceidle = false;
-   }
-
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -5335,8 +5334,10 @@ next_class:;
if (!rq_i->core_pick)
continue;
 
-   if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-   rq_i->core_forceidle = true;
+   if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+   !rq_i->core->core_forceidle) {
+   rq_i->core->core_forceidle = true;
+   }
 
if (i == cpu) {
rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f53681cd263e..42965c4fd71f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10692,6 +10692,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+   u64 slice = sched_slice(cfs_rq_of(se), se);
+   u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+   return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE  2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+   if (!sched_core_enabled(rq))
+   return;
+
+   /*
+* If runqueue has only one task which used up its slice and
+* if the sibling is forced idle, then trigger schedule to
+* give forced idle task a chance.
+*
+* sched_slice() considers only this active rq and it gets the
+* whole slice. But during force idle, we have siblings acting
+* like a single runqueue and hence we need to consider runnable
+* tasks on this cpu and the forced idle cpu. Ideally, we should
+* go through the forced idle rq, but that would be a perf hit.
+* We can assume that the forced idle cpu has atleast
+* MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+* if we need to give up the cpu.
+*/
+   if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+   __entity_slice_used(>se, MIN_NR_TASKS_DURING_FORCEIDLE))
+   resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10715,6 +10753,8 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
 
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+   task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63b28e1843ee..be656ca8693d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,12 +1069,12 @@ struct rq {
unsigned intcore_enabled;
unsigned intcore_sched_seq;
struct rb_root  core_tree;
-   unsigned char   core_forceidle;
 
/* shared state */
unsigned intcore_task_seq;
unsigned intcore_pick_seq;
unsigned long   core_cookie;
+   unsigned char   core_forceidle;
 #endif
 };
 
-- 
2.29.2.299.gdc1121823c-goog



Re: [RESEND][PATCH] ima: Set and clear FMODE_CAN_READ in ima_calc_file_hash()

2020-11-17 Thread Mimi Zohar
On Tue, 2020-11-17 at 10:23 -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 10:35 AM Mimi Zohar  wrote:
> >
> > We need to differentiate between signed files, which by definition are
> > immutable, and those that are mutable.  Appending to a mutable file,
> > for example, would result in the file hash not being updated.
> > Subsequent reads would fail.
> 
> Why would that require any reading of the file at all AT WRITE TIME?

On the (last) file close, the file hash is re-calculated and written
out as security.ima.  The EVM hmac is re-calculated and written out as
security.evm.

> 
> Don't do it. Really.

I really wish it wasn't needed.

> 
> When opening the file write-only, you just invalidate the hash. It
> doesn't matter anyway - you're only writing.
> 
> Later on, when reading, only at that point does the hash matter, and
> then you can do the verification.
> 
> Although honestly, I don't even see the point. You know the hash won't
> match, if you wrote to the file.

On the local system, as Roberto mentioned, before updating a file, the
existing file's data and metadata (EVM) should be verified to protect
from an offline attack.

The above scenario assumes calculating the file hash is only being used
for verifying the integrity of the file (security.ima), but there are
other reasons for calculating the file hash.  For example depending on
the IMA measurement policy, just accessing a file could require
including the file hash in the measurement list.  True that measurement
will only be valid at the time of measurement, but it provides a base
value.

Mimi



[PATCH -tip 21/32] sched: CGroup tagging interface for core scheduling

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 183 +--
 kernel/sched/sched.h |   4 +
 2 files changed, 181 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f807a84cc30..b99a7493d590 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,37 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+   return !RB_EMPTY_NODE(>core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   struct task_struct *task;
+
+   while (!sched_core_empty(rq)) {
+   task = sched_core_first(rq);
+   rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
+   }
+   rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
@@ -188,10 +219,11 @@ static void sched_core_dequeue(struct rq *rq, struct 
task_struct *p)
 {
rq->core->core_task_seq++;
 
-   if (!p->core_cookie)
+   if (!sched_core_enqueued(p))
return;
 
rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
 }
 
 /*
@@ -255,8 +287,24 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
-   cpu_rq(cpu)->core_enabled = enabled;
+   for_each_possible_cpu(cpu) {
+   struct rq *rq = cpu_rq(cpu);
+
+   WARN_ON_ONCE(enabled == rq->core_enabled);
+
+   if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
>= 2)) {
+   /*
+* All active and migrating tasks will have already
+* been removed from core queue when we clear the
+* cgroup tags. However, dying tasks could still be
+* left in core queue. Flush them here.
+*/
+   if (!enabled)
+   sched_core_flush(cpu);
+
+   rq->core_enabled = enabled;
+   }
+   }
 
return 0;
 }
@@ -266,7 +314,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-   // XXX verify there are no cookie tasks (yet)
+   int cpu;
+
+   /* verify there are no cookie tasks (yet) */
+   for_each_online_cpu(cpu)
+   BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +326,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-   // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
 }
@@ -300,6 +350,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3978,6 +4029,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #ifdef CONFIG_SMP

[PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups

2020-11-17 Thread Joel Fernandes (Google)
A previous patch improved cross-cpu vruntime comparison opertations in
pick_next_task(). Improve it further for tasks in CGroups.

In particular, for cross-CPU comparisons, we were previously going to
the root-level se(s) for both the task being compared. That was strange.
This patch instead finds the se(s) for both tasks that have the same
parent (which may be different from root).

A note about the min_vruntime snapshot and force idling:
Abbreviations: fi: force-idled now? ; fib: force-idled before?
During selection:
When we're not fi, we need to update snapshot.
when we're fi and we were not fi, we must update snapshot.
When we're fi and we were already fi, we must not update snapshot.

Which gives:
fib fi  update?
0   0   1
0   1   1
1   0   1
1   1   0
So the min_vruntime snapshot needs to be updated when: !(fib && fi).

Also, the cfs_prio_less() function needs to be aware of whether the core
is in force idle or not, since it will be use this information to know
whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
this information along via pick_task() -> prio_less().

Reviewed-by: Vineeth Pillai 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 61 +
 kernel/sched/fair.c  | 80 
 kernel/sched/sched.h |  7 +++-
 3 files changed, 97 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b373b592680..20125431af87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -101,7 +101,7 @@ static inline int __task_prio(struct task_struct *p)
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, 
bool in_fi)
 {
 
int pa = __task_prio(a), pb = __task_prio(b);
@@ -116,7 +116,7 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
if (pa == MAX_RT_PRIO + MAX_NICE)   /* fair */
-   return cfs_prio_less(a, b);
+   return cfs_prio_less(a, b, in_fi);
 
return false;
 }
@@ -130,7 +130,7 @@ static inline bool __sched_core_less(struct task_struct *a, 
struct task_struct *
return false;
 
/* flip prio, so high prio is leftmost */
-   if (prio_less(b, a))
+   if (prio_less(b, a, task_rq(a)->core->core_forceidle))
return true;
 
return false;
@@ -5101,7 +5101,7 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max, bool in_fi)
 {
struct task_struct *class_pick, *cookie_pick;
unsigned long cookie = rq->core->core_cookie;
@@ -5116,7 +5116,7 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
 * higher priority than max.
 */
if (max && class_pick->core_cookie &&
-   prio_less(class_pick, max))
+   prio_less(class_pick, max, in_fi))
return idle_sched_class.pick_task(rq);
 
return class_pick;
@@ -5135,13 +5135,15 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
 * the core (so far) and it must be selected, otherwise we must go with
 * the cookie pick in order to satisfy the constraint.
 */
-   if (prio_less(cookie_pick, class_pick) &&
-   (!max || prio_less(max, class_pick)))
+   if (prio_less(cookie_pick, class_pick, in_fi) &&
+   (!max || prio_less(max, class_pick, in_fi)))
return class_pick;
 
return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool 
in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -5230,9 +5232,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
if (!next->core_cookie) {
rq->core_pick = NULL;
+   /*
+* For robustness, update the min_vruntime_fi for
+* unconstrained picks as well.
+*/
+   WARN_ON_ONCE(fi_before);
+   task_vruntime_update(rq, next, false);
goto done;
}
-   need_sync = true;
}
 
for_each_cpu(i, smt_mask) {
@@ -5244,14 +5251,6 @@ pick_next_task(struct rq *rq, struct 

[PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit

2020-11-17 Thread Joel Fernandes (Google)
Add a generic_idle_{enter,exit} helper function to enter and exit kernel
protection when entering and exiting idle, respectively.

While at it, remove a stale RCU comment.

Reviewed-by: Alexandre Chartre 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/entry-common.h | 18 ++
 kernel/sched/idle.c  | 11 ++-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 022e1f114157..8f34ae625f83 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
&& _TIF_UNSAFE_RET != 0;
 }
+
+/**
+ * generic_idle_enter - General tasks to perform during idle entry.
+ */
+static inline void generic_idle_enter(void)
+{
+   /* Entering idle ends the protected kernel region. */
+   sched_core_unsafe_exit();
+}
+
+/**
+ * generic_idle_exit  - General tasks to perform during idle exit.
+ */
+static inline void generic_idle_exit(void)
+{
+   /* Exiting idle (re)starts the protected kernel region. */
+   sched_core_unsafe_enter();
+}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8bdb214eb78f..ee4f91396c31 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
  */
 #include "sched.h"
 
+#include 
 #include 
 
 /* Linker adds these: start and end of __cpuidle functions */
@@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
 
 static noinline int __cpuidle cpu_idle_poll(void)
 {
+   generic_idle_enter();
trace_cpu_idle(0, smp_processor_id());
stop_critical_timings();
rcu_idle_enter();
@@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
rcu_idle_exit();
start_critical_timings();
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+   generic_idle_exit();
 
return 1;
 }
@@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
return;
}
 
-   /*
-* The RCU framework needs to be told that we are entering an idle
-* section, so no more rcu read side critical sections and one more
-* step to the grace period
-*/
+   generic_idle_enter();
 
if (cpuidle_not_available(drv, dev)) {
tick_nohz_idle_stop_tick();
@@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
 */
if (WARN_ON_ONCE(irqs_disabled()))
local_irq_enable();
+
+   generic_idle_exit();
 }
 
 /*
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 32/32] sched: Debug bits...

2020-11-17 Thread Joel Fernandes (Google)
Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 35 ++-
 kernel/sched/fair.c |  9 +
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01938a2154fd..bbeeb18d460e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -127,6 +127,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -317,12 +321,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 DEFINE_STATIC_KEY_TRUE(sched_coresched_supported);
@@ -5486,6 +5494,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
rq->core_pick = NULL;
return next;
}
@@ -5580,6 +5595,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5596,6 +5614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5617,6 +5637,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -5658,13 +5679,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5704,6 +5733,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a89c7c917cc6..81c8a50ab4c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10798,6 +10798,15 @@ static void se_fi_update(struct sched_entity *se, 
unsigned int fi_seq, 

[PATCH -tip 31/32] sched: Add a coresched command line option

2020-11-17 Thread Joel Fernandes (Google)
Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
issues. Detect this and don't enable core scheduling as it can
needlessly slow those device down.

However, some users may want core scheduling even if the hardware is
secure. To support them, add a coresched= option which defaults to
'secure' and can be overridden to 'on' if the user wants to enable
coresched even if the HW is not vulnerable. 'off' would disable
core scheduling in any case.

Also add a sched_debug entry to indicate if core scheduling is turned on
or not.

Reviewed-by: Alexander Graf 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/kernel-parameters.txt | 14 ++
 arch/x86/kernel/cpu/bugs.c| 19 
 include/linux/cpu.h   |  1 +
 include/linux/sched/smt.h |  4 ++
 kernel/cpu.c  | 43 +++
 kernel/sched/core.c   |  6 +++
 kernel/sched/debug.c  |  4 ++
 7 files changed, 91 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b185c6ed4aba..9cd2cf7c18d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -698,6 +698,20 @@
/proc//coredump_filter.
See also Documentation/filesystems/proc.rst.
 
+   coresched=  [SCHED_CORE] This feature allows the Linux scheduler
+   to force hyperthread siblings of a CPU to only execute 
tasks
+   concurrently on all hyperthreads that are running 
within the
+   same core scheduling group.
+   Possible values are:
+   'on' - Enable scheduler capability to core schedule.
+   By default, no tasks will be core scheduled, but the 
coresched
+   interface can be used to form groups of tasks that are 
forced
+   to share a core.
+   'off' - Disable scheduler capability to core schedule.
+   'secure' - Like 'on' but only enable on systems 
affected by
+   MDS or L1TF vulnerabilities. 'off' otherwise.
+   Default: 'secure'.
+
coresight_cpu_debug.enable
[ARM,ARM64]
Format: 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index dece79e4d1e9..f3163f4a805c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -43,6 +43,7 @@ static void __init mds_select_mitigation(void);
 static void __init mds_print_mitigation(void);
 static void __init taa_select_mitigation(void);
 static void __init srbds_select_mitigation(void);
+static void __init coresched_select(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -103,6 +104,9 @@ void __init check_bugs(void)
if (boot_cpu_has(X86_FEATURE_STIBP))
x86_spec_ctrl_mask |= SPEC_CTRL_STIBP;
 
+   /* Update whether core-scheduling is needed. */
+   coresched_select();
+
/* Select the proper CPU mitigations before patching alternatives: */
spectre_v1_select_mitigation();
spectre_v2_select_mitigation();
@@ -1808,4 +1812,19 @@ ssize_t cpu_show_srbds(struct device *dev, struct 
device_attribute *attr, char *
 {
return cpu_show_common(dev, attr, buf, X86_BUG_SRBDS);
 }
+
+/*
+ * When coresched=secure command line option is passed (default), disable core
+ * scheduling if CPU does not have MDS/L1TF vulnerability.
+ */
+static void __init coresched_select(void)
+{
+#ifdef CONFIG_SCHED_CORE
+   if (coresched_cmd_secure() &&
+   !boot_cpu_has_bug(X86_BUG_MDS) &&
+   !boot_cpu_has_bug(X86_BUG_L1TF))
+   static_branch_disable(_coresched_supported);
+#endif
+}
+
 #endif
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d6428aaf67e7..d1f1e64316d6 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -228,4 +228,5 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control 
ctrlval) { return 0;
 extern bool cpu_mitigations_off(void);
 extern bool cpu_mitigations_auto_nosmt(void);
 
+extern bool coresched_cmd_secure(void);
 #endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/sched/smt.h b/include/linux/sched/smt.h
index 59d3736c454c..561064eb3268 100644
--- a/include/linux/sched/smt.h
+++ b/include/linux/sched/smt.h
@@ -17,4 +17,8 @@ static inline bool sched_smt_active(void) { return false; }
 
 void arch_smt_update(void);
 
+#ifdef CONFIG_SCHED_CORE
+extern struct static_key_true sched_coresched_supported;
+#endif
+
 #endif
diff --git a/kernel/cpu.c b/kernel/cpu.c
index fa535eaa4826..f22330c3ab4c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2559,3 +2559,46 @@ bool 

[PATCH v12 0/5] Add NUMA-awareness to qspinlock

2020-11-17 Thread Alex Kogan
Minor change from v11:


Fix documentation issue, as requested by Randy Dunlap and Longman.
The rest of the series is unchanged.

Summary
---

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a primary queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the MCS lock holder checks whether the next
waiter in the primary queue (if exists) is running on the same NUMA node.
If it is not, that waiter is detached from the main queue and moved into
the tail of the secondary queue. This way, we gradually filter the primary
queue, leaving only waiters running on the same preferred NUMA node. Note
that certain priortized waiters (e.g., in irq and nmi contexts) are
excluded from being moved to the secondary queue. We change the NUMA node
preference after a waiter at the head of the secondary queue spins for a
certain amount of time. We do that by flushing the secondary queue into
the head of the primary queue, effectively changing the preference to the
NUMA node of the waiter at the head of the secondary queue at the time of
the flush.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The 
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.10.0-rc2,
commit 4ef8451b3326, compiled in the default configuration. 
'patch-CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set; 
the speedup is calculated dividing 'patch-CNA' by 'stock'.

#thr stockpatch-CNA   speedup (patch-CNA/stock)
  1  2.672 (0.109)  2.679 (0.110)  1.002
  2  3.247 (0.161)  3.276 (0.136)  1.009
  4  4.376 (0.139)  4.820 (0.181)  1.101
  8  5.134 (0.137)  7.125 (0.164)  1.388
 16  5.875 (0.113)  8.903 (0.209)  1.515
 32  6.298 (0.105)  9.911 (0.254)  1.574
 36  6.439 (0.125)  9.972 (0.226)  1.549
 72  6.249 (0.109)  10.375 (0.209)  1.660
108  6.082 (0.063)  10.511 (0.190)  1.728
142  5.822 (0.058)  10.448 (0.177)  1.795

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads: 

#thr stockpatch-CNA   speedup (patch-CNA/stock)
  1  0.508 (0.002)  0.507 (0.002)  0.999
  2  0.759 (0.016)  0.767 (0.019)  1.011
  4  1.397 (0.032)  1.411 (0.029)  1.011
  8  1.694 (0.079)  1.656 (0.120)  0.978
 16  1.867 (0.107)  1.809 (0.121)  0.969
 32  1.006 (0.056)  1.752 (0.093)  1.742
 36  0.934 (0.099)  1.724 (0.064)  1.846
 72  0.804 (0.045)  1.632 (0.073)  2.030
108  0.828 (0.036)  1.690 (0.065)  2.041
142  0.784 (0.035)  1.701 (0.074)  2.168

and will-it-scale/lock2_threads:

#thr stockpatch-CNA   speedup (patch-CNA/stock)
  1  1.590 (0.004)  1.603 (0.005)  1.008
  2  2.802 (0.057)  2.802 (0.063)  1.000
  4  5.478 (0.144)  5.396 (0.299)  0.985
  8  4.166 (0.304)  4.131 (0.402)  0.992
 16  4.147 (0.137)  3.983 (0.173)  0.961
 32  2.492 (0.067)  3.888 (0.125)  1.560
 36  2.471 (0.094)  3.908 (0.112)  1.581
 72  1.886 (0.092)  3.926 (0.106)  2.081
108  1.883 (0.101)  3.935 (0.096)  2.089
142  1.801 (0.112)  3.907 (0.111)  2.169

Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes. Those mutexes are 
blocking, and waiting threads park and unpark via the futex 
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a 
chain-specific spin lock, the contention on a user-mode mutex 
translates into contention on a kernel level spinlock. 

Here are the throughput results (ops/us) for the leveldb ‘readrandom’
benchmark:

#thr stockpatch-CNA   speedup (patch-CNA/stock)
  1  0.532 (0.013)  0.533 (0.023)  1.004
  2  0.836 (0.050)  0.843 (0.031)  1.009
  4  1.039 (0.163)  1.087 (0.151)  1.046
  8  1.095 (0.181)  1.178 (0.165)  1.076
 16  1.002 (0.144)  1.196 (0.019)  1.194
 32  0.726 (0.034)  1.163 (0.026)  1.601
 36  0.691 (0.030)  1.163 (0.020)  1.683
 72  0.627 (0.014)  1.136 (0.022)  1.812
108  0.613 (0.014)  1.143 (0.023)  1.865
142  0.610 (0.014)  1.120 (0.018)  1.838

Further comments 

[PATCH -tip 30/32] Documentation: Add core scheduling documentation

2020-11-17 Thread Joel Fernandes (Google)
Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai 
Signed-off-by: Vineeth Pillai 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 330 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 331 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..01be28d0687a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,330 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group. For finer grained control, 
the
+  ``cpu.core_tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+* ``cpu.core_tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Up to 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.core_tag`` writable only by root and the
+``cpu.core_tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+ / \
+A   B(These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the 
``cpu.core_tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.core_tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants 
to
+allow a subset of child CGroups within a tagged parent CGroup to be 
co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not tracked by 

[PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Acked-by: Paul E. McKenney 
Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 130 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 344499ab29f2..7efce9c9d9cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12e8e6627ab3..3b373b592680 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -202,6 +202,21 @@ static struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned 
long cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -5134,8 +5149,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
const struct sched_class *class;
const struct cpumask *smt_mask;
bool fi_before = false;
+   int i, j, cpu, occ = 0;
bool need_sync;
-   int i, j, cpu;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -5260,6 +5275,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
if (!p)
continue;
 
+   if (!is_task_rq_idle(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -5285,6 +5303,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
}
}
@@ -5324,6 +5343,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq_i->core->core_forceidle = true;
}
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu) {
rq_i->core_pick = NULL;
continue;
@@ -5353,6 +5374,113 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_mask))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+   break;
+
+   if (try_steal_cookie(cpu, 

[PATCH -tip 29/32] sched: Move core-scheduler interfacing code to a new file

2020-11-17 Thread Joel Fernandes (Google)
core.c is already huge. The core-tagging interface code is largely
independent of it. Move it to its own file to make both files easier to
maintain.

Also make the following changes:
- Fix SWA bugs found by Chris Hyser.
- Fix refcount underrun caused by not zero'ing new task's cookie.

Tested-by: Julien Desfossez 
Reviewed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 809 +---
 kernel/sched/coretag.c | 819 +
 kernel/sched/sched.h   |  51 ++-
 4 files changed, 872 insertions(+), 808 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1d9762b571a..5ef04bdc849f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
return RB_EMPTY_ROOT(>core_tree);
 }
 
-static bool sched_core_enqueued(struct task_struct *task)
-{
-   return !RB_EMPTY_NODE(>core_node);
-}
-
 static struct task_struct *sched_core_first(struct rq *rq)
 {
struct task_struct *task;
@@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
rq->core->core_task_seq++;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
struct task_struct *node_task;
@@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct 
task_struct *p)
rb_insert_color(>core_node, >core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
rq->core->core_task_seq++;
 
@@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
-static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -346,16 +340,6 @@ void sched_core_put(void)
__sched_core_disable();
mutex_unlock(_core_mutex);
 }
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-static bool sched_core_enqueued(struct task_struct *task) { return false; }
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -3834,6 +3818,9 @@ static void __sched_fork(unsigned long clone_flags, 
struct task_struct *p)
p->capture_control = NULL;
 #endif
init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SCHED_CORE
+   p->core_task_cookie = 0;
+#endif
 #ifdef CONFIG_SMP
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
@@ -9118,11 +9105,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-   return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9735,787 +9717,6 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-#ifdef CONFIG_SCHED_CORE
-/*
- * Wrapper representing a complete cookie. The address of the cookie is used as
- * a unique identifier. Each cookie has a unique permutation of the internal
- * cookie fields.
- */
-struct sched_core_cookie {
-   unsigned long task_cookie;
-   unsigned long group_cookie;
-   unsigned long color;
-
-   struct rb_node node;
-   refcount_t refcnt;
-};
-
-/*
- * A simple wrapper around refcount. An allocated sched_core_task_cookie's
- * address is used to compute the cookie of the task.
- */
-struct sched_core_task_cookie {
-   refcount_t refcnt;
-};
-
-/* All active sched_core_cookies */
-static struct rb_root sched_core_cookies = RB_ROOT;
-static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
-
-/*
- * Returns the following:
- * a < b  => -1
- * a == b => 0
- * a > b  => 1
- */
-static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
-const struct sched_core_cookie *b)
-{
-#define COOKIE_CMP_RETURN(field) do {  \
-   if (a->field < b->field)\
-   

[PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

pick_next_entity() is passed curr == NULL during core-scheduling. Due to
this, if the rbtree is empty, the 'left' variable is set to NULL within
the function. This can cause crashes within the function.

This is not an issue if put_prev_task() is invoked on the currently
running task before calling pick_next_entity(). However, in core
scheduling, it is possible that a sibling CPU picks for another RQ in
the core, via pick_task_fair(). This remote sibling would not get any
opportunities to do a put_prev_task().

Fix it by refactoring pick_task_fair() such that pick_next_entity() is
called with the cfs_rq->curr. This will prevent pick_next_entity() from
crashing if its rbtree is empty.

Also this fixes another possible bug where update_curr() would not be
called on the cfs_rq hierarchy if the rbtree is empty. This could effect
cross-cpu comparison of vruntime.

Suggested-by: Vineeth Remanan Pillai 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12cf068eeec8..51483a00a755 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7029,15 +7029,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
do {
struct sched_entity *curr = cfs_rq->curr;
 
-   se = pick_next_entity(cfs_rq, NULL);
-
-   if (curr) {
-   if (se && curr->on_rq)
-   update_curr(cfs_rq);
+   if (curr && curr->on_rq)
+   update_curr(cfs_rq);
 
-   if (!se || entity_before(curr, se))
-   se = curr;
-   }
+   se = pick_next_entity(cfs_rq, curr);
 
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
-- 
2.29.2.299.gdc1121823c-goog



[PATCH v12 2/5] locking/qspinlock: Refactor the qspinlock slow path

2020-11-17 Thread Alex Kogan
Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 kernel/locking/qspinlock.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 435d696..e351870 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct 
qspinlock *lock,
 #define queued_spin_lock_slowpath  native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+u32 val,
+struct mcs_spinlock *node)
+{
+   return atomic_try_cmpxchg_relaxed(>val, , _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+  struct mcs_spinlock *next)
+{
+   arch_mcs_lock_handoff(>locked, 1);
+}
+
+#define try_clear_tail __try_clear_tail
+#define mcs_lock_handoff   __mcs_lock_handoff
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *   PENDING will make the uncontended transition fail.
 */
if ((val & _Q_TAIL_MASK) == tail) {
-   if (atomic_try_cmpxchg_relaxed(>val, , _Q_LOCKED_VAL))
+   if (try_clear_tail(lock, val, node))
goto release; /* No contention */
}
 
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
if (!next)
next = smp_cond_load_relaxed(>next, (VAL));
 
-   arch_mcs_lock_handoff(>locked, 1);
+   mcs_lock_handoff(node, next);
pv_kick_node(lock, next);
 
 release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail __try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff   __mcs_lock_handoff
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath  __pv_queued_spin_lock_slowpath
 
-- 
2.7.4



[PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-11-17 Thread Joel Fernandes (Google)
From: Josh Don 

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B(These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 120 +++---
 kernel/sched/sched.h  |   2 +
 3 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6fbdb1a204bf..c9efdf8ccdf3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -690,6 +690,7 @@ struct task_struct {
unsigned long   core_cookie;
unsigned long   core_task_cookie;
unsigned long   core_group_cookie;
+   unsigned long   core_color;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd75b3d62a97..8f17ec8e993e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9049,9 +9049,6 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9747,6 +9744,7 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 struct sched_core_cookie {
unsigned long task_cookie;
unsigned long group_cookie;
+   unsigned long color;
 
struct rb_node node;
refcount_t refcnt;
@@ -9782,6 +9780,7 @@ static int sched_core_cookie_cmp(const struct 
sched_core_cookie *a,
 
COOKIE_CMP_RETURN(task_cookie);
COOKIE_CMP_RETURN(group_cookie);
+   COOKIE_CMP_RETURN(color);
 
/* all cookie fields match */
return 0;
@@ -9819,7 +9818,7 @@ static void sched_core_put_cookie(struct 
sched_core_cookie *cookie)
 
 /*
  * A task's core cookie is a compound structure composed of various cookie
- * fields (task_cookie, group_cookie). The overall core_cookie is
+ * fields (task_cookie, group_cookie, color). The overall core_cookie is
  * a pointer to a struct containing those values. This function either finds
  * an existing core_cookie or creates a new one, and then updates the task's
  * core_cookie to point to it. Additionally, it handles the necessary reference
@@ -9837,6 +9836,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
struct sched_core_cookie temp = {
.task_cookie= p->core_task_cookie,
.group_cookie   = p->core_group_cookie,
+   .color  = p->core_color
};
const bool is_zero_cookie =
(sched_core_cookie_cmp(, _cookie) == 0);
@@ -9892,6 +9892,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
 
match->task_cookie = temp.task_cookie;
match->group_cookie = temp.group_cookie;
+   match->color = temp.color;
refcount_set(>refcnt, 1);
 
rb_link_node(>node, parent, node);
@@ -9949,6 +9950,9 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
case sched_core_group_cookie_type:
p->core_group_cookie = cookie;
break;
+   case sched_core_color_type:
+   p->core_color = cookie;
+   break;
default:
WARN_ON_ONCE(1);
}
@@ -9967,19 +9971,23 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
sched_core_enqueue(task_rq(p), p);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
+void 

[PATCH -tip 28/32] kselftest: Add tests for core-sched interface

2020-11-17 Thread Joel Fernandes (Google)
Add a kselftest test to ensure that the core-sched interface is working
correctly.

Tested-by: Julien Desfossez 
Reviewed-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 .../testing/selftests/sched/test_coresched.c  | 818 ++
 4 files changed, 834 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..70ed2758fe23
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,818 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+}
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+char path[50] = {}, *val;
+int fd;
+
+sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+fd = open(path, O_RDONLY, 0666);
+if (fd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+val = calloc(1, 50);
+if (read(fd, val, 50) == -1) {
+   perror("Failed to read group cookie: ");
+   abort();
+}
+
+val[strcspn(val, "\r\n")] = 0;
+
+close(fd);
+return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+tfd = open(tag_path, O_RDONLY, 0666);
+if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+}
+
+if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+   abort();
+}
+
+if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+}
+}
+
+void assert_group_color(char *cgroup_path, const char *color)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+tfd = open(tag_path, O_RDONLY, 0666);
+if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+if (read(tfd, rdbuf, 8) == -1) {
+   perror("Failed to read group color\n");
+   abort();
+}
+
+if 

[PATCH v12 5/5] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA

2020-11-17 Thread Alex Kogan
Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 kernel/locking/qspinlock_cna.h | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index d3e2754..ac3109a 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include 
+#include 
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -35,7 +36,8 @@
  * running on the same NUMA node. If it is not, that waiter is detached from 
the
  * main queue and moved into the tail of the secondary queue. This way, we
  * gradually filter the primary queue, leaving only waiters running on the same
- * preferred NUMA node.
+ * preferred NUMA node. Note that certain priortized waiters (e.g., in
+ * irq and nmi contexts) are excluded from being moved to the secondary queue.
  *
  * We change the NUMA node preference after a waiter at the head of the
  * secondary queue spins for a certain amount of time (10ms, by default).
@@ -49,6 +51,8 @@
  *  Dave Dice 
  */
 
+#define CNA_PRIORITY_NODE  0x
+
 struct cna_node {
struct mcs_spinlock mcs;
u16 numa_node;
@@ -121,9 +125,10 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
+   bool priority = !in_task() || irqs_disabled() || rt_task(current);
struct cna_node *cn = (struct cna_node *)node;
 
-   cn->numa_node = cn->real_numa_node;
+   cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
cn->start_time = 0;
 }
 
@@ -262,11 +267,13 @@ static u32 cna_order_queue(struct mcs_spinlock *node)
next_numa_node = ((struct cna_node *)next)->numa_node;
 
if (next_numa_node != numa_node) {
-   struct mcs_spinlock *nnext = READ_ONCE(next->next);
+   if (next_numa_node != CNA_PRIORITY_NODE) {
+   struct mcs_spinlock *nnext = READ_ONCE(next->next);
 
-   if (nnext) {
-   cna_splice_next(node, next, nnext);
-   next = nnext;
+   if (nnext) {
+   cna_splice_next(node, next, nnext);
+   next = nnext;
+   }
}
/*
 * Inherit NUMA node id of primary queue, to maintain the
@@ -285,6 +292,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct 
qspinlock *lock,
 
if (!cn->start_time || !intra_node_threshold_reached(cn)) {
/*
+* We are at the head of the wait queue, no need to use
+* the fake NUMA node ID.
+*/
+   if (cn->numa_node == CNA_PRIORITY_NODE)
+   cn->numa_node = cn->real_numa_node;
+
+   /*
 * Try and put the time otherwise spent spin waiting on
 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
 */
-- 
2.7.4



[PATCH -tip 27/32] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG

2020-11-17 Thread Joel Fernandes (Google)
This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Reviewed-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f17ec8e993e..f1d9762b571a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10277,6 +10277,21 @@ static u64 cpu_core_tag_color_read_u64(struct 
cgroup_subsys_state *css, struct c
return tg->core_tag_color;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, 
struct cftype *cft)
+{
+   unsigned long group_cookie, color;
+
+   cpu_core_get_group_cookie_and_color(css_tg(css), _cookie, );
+
+   /*
+* Combine group_cookie and color into a single 64 bit value, for
+* display purposes only.
+*/
+   return (group_cookie << 32) | (color & 0x);
+}
+#endif
+
 struct write_core_tag {
struct cgroup_subsys_state *css;
unsigned long cookie;
@@ -10550,6 +10565,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
{
@@ -10737,6 +10760,14 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
{
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 25/32] sched: Refactor core cookie into struct

2020-11-17 Thread Joel Fernandes (Google)
From: Josh Don 

The overall core cookie is currently a single unsigned long value. This
poses issues as we seek to add additional sub-fields to the cookie. This
patch refactors the core_cookie to be a pointer to a struct containing
an arbitrary set of cookie fields.

We maintain a sorted RB tree of existing core cookies so that multiple
tasks may share the same core_cookie.

This will be especially useful in the next patch, where the concept of
cookie color is introduced.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 481 +--
 kernel/sched/sched.h |  11 +-
 2 files changed, 429 insertions(+), 63 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc36c384364e..bd75b3d62a97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3958,6 +3958,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
unsigned long flags;
+   int __maybe_unused ret;
 
__sched_fork(clone_flags, p);
/*
@@ -4037,20 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(>core_node);
 
-   /*
-* If parent is tagged via per-task cookie, tag the child (either with
-* the parent's cookie, or a new one). The final cookie is calculated
-* by concatenating the per-task cookie with that of the CGroup's.
-*/
-   if (current->core_task_cookie) {
-
-   /* If it is not CLONE_THREAD fork, assign a unique per-task 
tag. */
-   if (!(clone_flags & CLONE_THREAD)) {
-   return sched_core_share_tasks(p, p);
-   }
-   /* Otherwise share the parent's per-task tag. */
-   return sched_core_share_tasks(p, current);
-   }
+   ret = sched_core_fork(p, clone_flags);
+   if (ret)
+   return ret;
 #endif
return 0;
 }
@@ -9059,6 +9049,9 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
+void cpu_core_get_group_cookie(struct task_group *tg,
+  unsigned long *group_cookie_ptr);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9073,11 +9066,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = autogroup_task_group(tsk, tg);
 
 #ifdef CONFIG_SCHED_CORE
-   if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
-   tsk->core_cookie = 0UL;
-
-   if (tg->tagged /* && !tsk->core_cookie ? */)
-   tsk->core_cookie = (unsigned long)tg;
+   sched_core_change_group(tsk, tg);
 #endif
 
tsk->sched_task_group = tg;
@@ -9177,9 +9166,9 @@ static void cpu_cgroup_css_offline(struct 
cgroup_subsys_state *css)
 #ifdef CONFIG_SCHED_CORE
struct task_group *tg = css_tg(css);
 
-   if (tg->tagged) {
+   if (tg->core_tagged) {
sched_core_put();
-   tg->tagged = 0;
+   tg->core_tagged = 0;
}
 #endif
 }
@@ -9751,38 +9740,225 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 
 #ifdef CONFIG_SCHED_CORE
 /*
- * A simple wrapper around refcount. An allocated sched_core_cookie's
- * address is used to compute the cookie of the task.
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
  */
 struct sched_core_cookie {
+   unsigned long task_cookie;
+   unsigned long group_cookie;
+
+   struct rb_node node;
refcount_t refcnt;
 };
 
 /*
- * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
- * @p: The task to assign a cookie to.
- * @cookie: The cookie to assign.
- * @group: is it a group interface or a per-task interface.
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+   refcount_t refcnt;
+};
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {  \
+   if (a->field < b->field)\
+   return -1;  \
+   else if (a->field > b->field)   \
+   return 1;   \
+} while (0)\
+
+   COOKIE_CMP_RETURN(task_cookie);
+   COOKIE_CMP_RETURN(group_cookie);
+

[PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-11-17 Thread Joel Fernandes (Google)
Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez 
Reviewed-by: Aubrey Li 
Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  1 +
 include/uapi/linux/prctl.h   |  3 ++
 kernel/sched/core.c  | 51 +---
 kernel/sys.c |  3 ++
 tools/include/uapi/linux/prctl.h |  3 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6a3b0fa952b..79d76c78cc8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,6 +2083,7 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ccca355623a..a95898c75bdf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -4037,8 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
RB_CLEAR_NODE(>core_node);
 
/*
-* Tag child via per-task cookie only if parent is tagged via per-task
-* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+* If parent is tagged via per-task cookie, tag the child (either with
+* the parent's cookie, or a new one). The final cookie is calculated
+* by concatenating the per-task cookie with that of the CGroup's.
 */
if (current->core_task_cookie) {
 
@@ -9855,7 +9857,7 @@ static int sched_core_share_tasks(struct task_struct *t1, 
struct task_struct *t2
unsigned long cookie;
int ret = -ENOMEM;
 
-   mutex_lock(_core_mutex);
+   mutex_lock(_core_tasks_mutex);
 
/*
 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9954,10 +9956,51 @@ static int sched_core_share_tasks(struct task_struct 
*t1, struct task_struct *t2
 
ret = 0;
 out_unlock:
-   mutex_unlock(_core_mutex);
+   mutex_unlock(_core_tasks_mutex);
return ret;
 }
 
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+   struct task_struct *task;
+   int err;
+
+   if (pid == 0) { /* Recent current task's cookie. */
+   /* Resetting a cookie requires privileges. */
+   if (current->core_task_cookie)
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   task = NULL;
+   } else {
+   rcu_read_lock();
+   task = pid ? find_task_by_vpid(pid) : current;
+   if (!task) {
+   rcu_read_unlock();
+   return -ESRCH;
+   }
+
+   get_task_struct(task);
+
+   /*
+* Check if this process has the right to modify the specified
+* process. Use the regular "ptrace_may_access()" checks.
+*/
+   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+   rcu_read_unlock();
+   err = -EPERM;
+   goto out_put;
+   }
+   rcu_read_unlock();
+   }
+
+   err = sched_core_share_tasks(current, task);
+out_put:
+   if (task)
+   put_task_struct(task);
+   return err;
+}
+
 /* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..61a3c98e36de 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
 
error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;

[PATCH -tip 24/32] sched: Release references to the per-task cookie on exit

2020-11-17 Thread Joel Fernandes (Google)
During exit, we have to free the references to a cookie that might be shared by
many tasks. This commit therefore ensures when the task_struct is released, any
references to cookies that it holds are also released.

Reviewed-by: Chris Hyser 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h | 3 +++
 kernel/fork.c | 1 +
 kernel/sched/core.c   | 8 
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 79d76c78cc8e..6fbdb1a204bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,11 +2084,14 @@ void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
 int sched_core_share_pid(pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a95898c75bdf..cc36c384364e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10066,6 +10066,14 @@ static int cpu_core_tag_write_u64(struct 
cgroup_subsys_state *css, struct cftype
 
return 0;
 }
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+   if (!tsk->core_task_cookie)
+   return;
+   sched_core_put_task_cookie(tsk->core_task_cookie);
+   sched_core_put();
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-11-17 Thread Joel Fernandes (Google)
In order to prevent interference and clearly support both per-task and CGroup
APIs, split the cookie into 2 and allow it to be set from either per-task, or
CGroup API. The final cookie is the combined value of both and is computed when
the stop-machine executes during a change of cookie.

Also, for the per-task cookie, it will get weird if we use pointers of any
emphemeral objects. For this reason, introduce a refcounted object who's sole
purpose is to assign unique cookie value by way of the object's pointer.

While at it, refactor the CGroup code a bit. Future patches will introduce more
APIs and support.

Reviewed-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   | 241 --
 kernel/sched/debug.c  |   4 +
 3 files changed, 236 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a60868165590..c6a3b0fa952b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b99a7493d590..7ccca355623a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -346,11 +346,14 @@ void sched_core_put(void)
mutex_unlock(_core_mutex);
 }
 
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -4032,6 +4035,20 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(>core_node);
+
+   /*
+* Tag child via per-task cookie only if parent is tagged via per-task
+* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+*/
+   if (current->core_task_cookie) {
+
+   /* If it is not CLONE_THREAD fork, assign a unique per-task 
tag. */
+   if (!(clone_flags & CLONE_THREAD)) {
+   return sched_core_share_tasks(p, p);
+   }
+   /* Otherwise share the parent's per-task tag. */
+   return sched_core_share_tasks(p, current);
+   }
 #endif
return 0;
 }
@@ -9731,6 +9748,217 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_SCHED_CORE
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+   refcount_t refcnt;
+};
+
+/*
+ * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
+ * @p: The task to assign a cookie to.
+ * @cookie: The cookie to assign.
+ * @group: is it a group interface or a per-task interface.
+ *
+ * This function is typically called from a stop-machine handler.
+ */
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool 
group)
+{
+   if (!p)
+   return;
+
+   if (group)
+   p->core_group_cookie = cookie;
+   else
+   p->core_task_cookie = cookie;
+
+   /* Use up half of the cookie's bits for task cookie and remaining for 
group cookie. */
+   p->core_cookie = (p->core_task_cookie <<
+   (sizeof(unsigned long) * 4)) + 
p->core_group_cookie;
+
+   if (sched_core_enqueued(p)) {
+   sched_core_dequeue(task_rq(p), p);
+   if (!p->core_task_cookie)
+   return;
+   }
+
+   if (sched_core_enabled(task_rq(p)) &&
+   p->core_cookie && task_on_rq_queued(p))
+   sched_core_enqueue(task_rq(p), p);
+}
+
+/* Per-task interface */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+   struct sched_core_cookie *ptr =
+   kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+
+   if (!ptr)
+   return 0;
+   refcount_set(>refcnt, 1);
+
+   /*
+* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+* is done after the stopper runs.
+*/
+   sched_core_get();
+   return (unsigned long)ptr;
+}
+
+static bool sched_core_get_task_cookie(unsigned long cookie)
+{
+   struct sched_core_cookie *ptr = (struct 

[PATCH -tip 20/32] entry/kvm: Protect the kernel when entering from guest

2020-11-17 Thread Joel Fernandes (Google)
From: Vineeth Pillai 

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Reviewed-by: Alexandre Chartre 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/kvm/x86.c|  2 ++
 include/linux/entry-kvm.h | 12 
 kernel/entry/kvm.c| 33 +
 3 files changed, 47 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 447edc0d1d5a..a50be74f70f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8910,6 +8910,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 */
smp_mb__after_srcu_read_unlock();
 
+   kvm_exit_to_guest_mode();
/*
 * This handles the case where a posted interrupt was
 * notified with kvm_vcpu_kick.
@@ -9003,6 +9004,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}
 
+   kvm_enter_from_guest_mode();
local_irq_enable();
preempt_enable();
 
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 9b93f8584ff7..67da6dcf442b 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from 
guest.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_enter_from_guest_mode(void);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_exit_to_guest_mode(void);
+
 #endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 49972ee99aff..3b603e8bd5da 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -50,3 +50,36 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
return xfer_to_guest_mode_work(vcpu, ti_work);
 }
 EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from 
guest.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_enter_from_guest_mode(void)
+{
+   if (!entry_kernel_protected())
+   return;
+   sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_exit_to_guest_mode(void)
+{
+   if (!entry_kernel_protected())
+   return;
+   sched_core_unsafe_exit();
+
+   /*
+* Wait here instead of in xfer_to_guest_mode_handle_work(). The reason
+* is because in vcpu_run(), xfer_to_guest_mode_handle_work() is called
+* when a vCPU was either runnable or blocked. However, we only care
+* about the runnable case (VM entry/exit) which is handled by
+* vcpu_enter_guest().
+*/
+   sched_core_wait_till_safe(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode

2020-11-17 Thread Joel Fernandes (Google)
Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Reviewed-by: Alexandre Chartre 
Tested-by: Julien Desfossez 
Cc: Julien Desfossez 
Cc: Tim Chen 
Cc: Aaron Lu 
Cc: Aubrey Li 
Cc: Tim Chen 
Cc: Paul E. McKenney 
Co-developed-by: Vineeth Pillai 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/kernel-parameters.txt |  11 +
 include/linux/entry-common.h  |  12 +-
 include/linux/sched.h |  12 +
 kernel/entry/common.c |  28 +-
 kernel/sched/core.c   | 241 ++
 kernel/sched/sched.h  |   3 +
 6 files changed, 304 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bd1a5b87a5e2..b185c6ed4aba 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,17 @@
 
sbni=   [NET] Granch SBNI12 leased line adapter
 
+   sched_core_protect_kernel=
+   [SCHED_CORE] Pause SMT siblings of a core running in
+   user mode, if at least one of the siblings of the core
+   is running in kernel mode. This is to guarantee that
+   kernel data is not leaked to tasks which are not trusted
+   by the kernel. A value of 0 disables protection, 1
+   enables protection. The default is 1. Note that 
protection
+   depends on the arch defining the _TIF_UNSAFE_RET flag.
+   Further, for protecting VMEXIT, arch needs to call
+   KVM entry/exit hooks.
+
sched_debug [KNL] Enables verbose scheduler debug messages.
 
schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1a128baf3628..022e1f114157 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET   (0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE   (0)
 #endif
@@ -74,7 +78,7 @@
 #define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |  \
-ARCH_EXIT_TO_USER_MODE_WORK)
+_TIF_UNSAFE_RET | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
@@ -444,4 +448,10 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs 
*regs);
  */
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t 
irq_state);
 
+/* entry_kernel_protected - Is kernel protection on entry/exit into kernel 
supported? */
+static inline bool entry_kernel_protected(void)
+{
+   return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
+   && _TIF_UNSAFE_RET != 0;
+}
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..a60868165590 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2076,4 +2076,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c

[PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks

2020-11-17 Thread Joel Fernandes (Google)
Add a new TIF flag to indicate whether the kernel needs to be careful
and take additional steps to mitigate micro-architectural issues during
entry into user or guest mode.

This new flag will be used by the series to determine if waiting is
needed or not, during exit to user or guest mode.

Tested-by: Julien Desfossez 
Reviewed-by: Aubrey Li 
Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/include/asm/thread_info.h | 2 ++
 kernel/sched/sched.h   | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 93277a8d2ef0..ae4f6196e38c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -99,6 +99,7 @@ struct thread_info {
 #define TIF_SPEC_FORCE_UPDATE  23  /* Force speculation MSR update in 
context switch */
 #define TIF_FORCED_TF  24  /* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP  25  /* set when we want DEBUGCTLMSR_BTF */
+#define TIF_UNSAFE_RET 26  /* On return to process/guest, perform 
safety checks. */
 #define TIF_LAZY_MMU_UPDATES   27  /* task is updating the mmu lazily */
 #define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
@@ -127,6 +128,7 @@ struct thread_info {
 #define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
 #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
+#define _TIF_UNSAFE_RET(1 << TIF_UNSAFE_RET)
 #define _TIF_LAZY_MMU_UPDATES  (1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1 << TIF_ADDR32)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c258ab64052..615092cb693c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2851,3 +2851,9 @@ static inline bool is_per_cpu_kthread(struct task_struct 
*p)
 
 void swake_up_all_locked(struct swait_queue_head *q);
 void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+
+#ifdef CONFIG_SCHED_CORE
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
+#endif
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 16/32] irq_work: Cleanup

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

(Needed for irq_work_is_busy() API in core-scheduling series).

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 drivers/gpu/drm/i915/i915_request.c |  4 ++--
 include/linux/irq_work.h| 33 ++---
 include/linux/irqflags.h|  4 ++--
 kernel/bpf/stackmap.c   |  2 +-
 kernel/irq_work.c   | 18 
 kernel/printk/printk.c  |  6 ++
 kernel/rcu/tree.c   |  3 +--
 kernel/time/tick-sched.c|  6 ++
 kernel/trace/bpf_trace.c|  2 +-
 9 files changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 0e813819b041..5385b081a376 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -197,7 +197,7 @@ __notify_execute_cb(struct i915_request *rq, bool 
(*fn)(struct irq_work *wrk))
 
llist_for_each_entry_safe(cb, cn,
  llist_del_all(>execute_cb),
- work.llnode)
+ work.node.llist)
fn(>work);
 }
 
@@ -460,7 +460,7 @@ __await_execution(struct i915_request *rq,
 * callback first, then checking the ACTIVE bit, we serialise with
 * the completed/retired request.
 */
-   if (llist_add(>work.llnode, >execute_cb)) {
+   if (llist_add(>work.node.llist, >execute_cb)) {
if (i915_request_is_active(signal) ||
__request_in_flight(signal))
__notify_execute_cb_imm(signal);
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..ec2a47a81e42 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -14,28 +14,37 @@
  */
 
 struct irq_work {
-   union {
-   struct __call_single_node node;
-   struct {
-   struct llist_node llnode;
-   atomic_t flags;
-   };
-   };
+   struct __call_single_node node;
void (*func)(struct irq_work *);
 };
 
+#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){  \
+   .node = { .u_flags = (_flags), },   \
+   .func = (_func),\
+}
+
+#define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
+#define IRQ_WORK_INIT_LAZY(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY)
+#define IRQ_WORK_INIT_HARD(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_HARD_IRQ)
+
+#define DEFINE_IRQ_WORK(name, _f)  \
+   struct irq_work name = IRQ_WORK_INIT(_f)
+
 static inline
 void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
 {
-   atomic_set(>flags, 0);
-   work->func = func;
+   *work = IRQ_WORK_INIT(func);
 }
 
-#define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { \
-   .flags = ATOMIC_INIT(0),\
-   .func  = (_f)   \
+static inline bool irq_work_is_pending(struct irq_work *work)
+{
+   return atomic_read(>node.a_flags) & IRQ_WORK_PENDING;
 }
 
+static inline bool irq_work_is_busy(struct irq_work *work)
+{
+   return atomic_read(>node.a_flags) & IRQ_WORK_BUSY;
+}
 
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 3ed4e8771b64..fef2d43a7a1d 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -109,12 +109,12 @@ do {  \
 
 # define lockdep_irq_work_enter(__work)
\
  do {  \
- if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+ if (!(atomic_read(&__work->node.a_flags) & 
IRQ_WORK_HARD_IRQ))\
current->irq_config = 1;\
  } while (0)
 # define lockdep_irq_work_exit(__work) \
  do {  \
- if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+ if (!(atomic_read(&__work->node.a_flags) & 
IRQ_WORK_HARD_IRQ))\
current->irq_config = 0;\
  } while (0)
 
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 06065fa27124..599041cd0c8a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -298,7 +298,7 @@ static void stack_map_get_build_id_offset(struct 
bpf_stack_build_id *id_offs,
if (irqs_disabled()) {
   

[PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-17 Thread Joel Fernandes (Google)
From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 64 
 kernel/sched/sched.h | 29 
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..ceb3906c9a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+#ifdef CONFIG_SCHED_CORE
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+#endif
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+#endif
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   break;
+
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+   /*
+* If Core Scheduling is enabled, select this cpu
+* only if the process cookie matches core cookie.
+*/
+   if (sched_core_enabled(cpu_rq(cpu)) &&
+   p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+   break;
+   }
}
 
time = cpu_clock(this) - time;
@@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+#ifdef CONFIG_SCHED_CORE
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+#ifdef CONFIG_SCHED_CORE
+   if (sched_core_enabled(cpu_rq(this_cpu))) {
+   int i = 0;
+   bool cookie_match = 

[PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case

2020-11-17 Thread Joel Fernandes (Google)
The core pick loop grew a lot of warts over time to support
optimizations. Turns out that that directly doing a class pick before
entering the core-wide pick is better for readability. Make the changes.

Since this is a relatively new patch, make it a separate patch so that
it is easier to revert in case anyone reports an issue with it. Testing
shows it to be working for me.

Reviewed-by: Vineeth Pillai 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 73 -
 1 file changed, 26 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6aa76de55ef2..12e8e6627ab3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5180,6 +5180,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
put_prev_task_balance(rq, prev, rf);
 
smt_mask = cpu_smt_mask(cpu);
+   need_sync = !!rq->core->core_cookie;
+
+   /* reset state */
+   rq->core->core_cookie = 0UL;
+   if (rq->core->core_forceidle) {
+   need_sync = true;
+   fi_before = true;
+   rq->core->core_forceidle = false;
+   }
 
/*
 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -5192,16 +5201,25 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * 'Fix' this by also increasing @task_seq for every pick.
 */
rq->core->core_task_seq++;
-   need_sync = !!rq->core->core_cookie;
 
-   /* reset state */
-reset:
-   rq->core->core_cookie = 0UL;
-   if (rq->core->core_forceidle) {
+   /*
+* Optimize for common case where this CPU has no cookies
+* and there are no cookied tasks running on siblings.
+*/
+   if (!need_sync) {
+   for_each_class(class) {
+   next = class->pick_task(rq);
+   if (next)
+   break;
+   }
+
+   if (!next->core_cookie) {
+   rq->core_pick = NULL;
+   goto done;
+   }
need_sync = true;
-   fi_before = true;
-   rq->core->core_forceidle = false;
}
+
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
@@ -5239,38 +5257,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * core.
 */
p = pick_task(rq_i, class, max);
-   if (!p) {
-   /*
-* If there weren't no cookies; we don't need to
-* bother with the other siblings.
-*/
-   if (i == cpu && !need_sync)
-   goto next_class;
-
+   if (!p)
continue;
-   }
-
-   /*
-* Optimize the 'normal' case where there aren't any
-* cookies and we don't need to sync up.
-*/
-   if (i == cpu && !need_sync) {
-   if (p->core_cookie) {
-   /*
-* This optimization is only valid as
-* long as there are no cookies
-* involved. We may have skipped
-* non-empty higher priority classes on
-* siblings, which are empty on this
-* CPU, so start over.
-*/
-   need_sync = true;
-   goto reset;
-   }
-
-   next = p;
-   goto done;
-   }
 
rq_i->core_pick = p;
 
@@ -5298,18 +5286,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
cpu_rq(j)->core_pick = NULL;
}
goto again;
-   } else {
-   /*
-* Once we select a task for a cpu, we
-* should not be doing an unconstrained
-* pick because it might starve a task
-* on a forced idle cpu.
-*/
-   need_sync = true;
  

Re: [PATCH] interconnect: qcom: msm8916: Remove rpm-ids from non-RPM nodes

2020-11-17 Thread Mike Tipton

On 11/12/2020 2:51 AM, Georgi Djakov wrote:

Some nodes are incorrectly marked as RPM-controlled (they have RPM
master and slave ids assigned), but are actually controlled by the
application CPU instead. The RPM complains when we send requests for
resources that it can't control. Let's fix this by replacing the IDs,
with the default "-1" in which case no requests are sent.

Signed-off-by: Georgi Djakov 


Reviewed-by: Mike Tipton 



[PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs) to
see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0rq1
CFS1 (tagged)  RT1 (not tag)
CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if() block
and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0 rq1
CFS1IDLE

When it should have selected:
rq0 r1
IDLERT

Joel saw this issue on real-world usecases in ChromeOS where an RT task
gets constantly force-idled and breaks RT. Lets cure it.

NOTE: This problem will be fixed differently in a later patch. It just
  kept here for reference purposes about this issue, and to make
  applying later patches easier.

Reported-by: Joel Fernandes (Google) 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ee4902c2cf5..53af817740c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
need_sync = !!rq->core->core_cookie;
 
/* reset state */
+reset:
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
@@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
/*
 * If there weren't no cookies; we don't need to
 * bother with the other siblings.
-* If the rest of the core is not running a 
tagged
-* task, i.e.  need_sync == 0, and the current 
CPU
-* which called into the schedule() loop does 
not
-* have any tasks for this class, skip 
selecting for
-* other siblings since there's no point. We 
don't skip
-* for RT/DL because that could make CFS 
force-idle RT.
 */
-   if (i == cpu && !need_sync && class == 
_sched_class)
+   if (i == cpu && !need_sync)
goto next_class;
 
continue;
@@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * Optimize the 'normal' case where there aren't any
 * cookies and we don't need to sync up.
 */
-   if (i == cpu && !need_sync && !p->core_cookie) {
+   if (i == cpu && !need_sync) {
+   if (p->core_cookie) {
+   /*
+* This optimization is only valid as
+* long as there are no cookies
+* involved. We may have skipped
+* non-empty higher priority classes on
+* siblings, which are empty on this
+* CPU, so start over.
+*/
+   need_sync = true;
+   goto reset;
+   }
+
next = p;
goto done;
}
@@ -5299,7 +5307,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 */
need_sync = true;
}
-
}
}
 next_class:;
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 11/32] sched: Enqueue task into core queue only after vruntime is updated

2020-11-17 Thread Joel Fernandes (Google)
A waking task may have its vruntime adjusted. However, the code right
now puts it into the core queue without the adjustment. This means the
core queue may have a task with incorrect vruntime, potentially a very
long one. This may cause a task to get artificially boosted during
picking.

Fix it by enqueuing into the core queue only after the class-specific
enqueue callback has been called. This ensures that for CFS tasks, the
updated vruntime value is used when enqueuing the task into the core
rbtree.

Reviewed-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53af817740c0..6aa76de55ef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1753,9 +1753,6 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int 
flags)
 {
-   if (sched_core_enabled(rq))
-   sched_core_enqueue(rq, p);
-
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
 
@@ -1766,6 +1763,9 @@ static inline void enqueue_task(struct rq *rq, struct 
task_struct *p, int flags)
 
uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
+
+   if (sched_core_enabled(rq))
+   sched_core_enqueue(rq, p);
 }
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int 
flags)
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 07/32] sched: Add core wide task selection and scheduling.

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Aaron Lu 
Signed-off-by: Tim Chen 
Signed-off-by: Chen Yu 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 301 ++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d521033777f..1bd0b0bbb040 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5029,7 +5029,7 @@ static void put_prev_task_balance(struct rq *rq, struct 
task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -5070,6 +5070,294 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 
 #ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+   return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+   return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+   if (is_task_rq_idle(a) || is_task_rq_idle(b))
+   return true;
+
+   return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = rq->core->core_cookie;
+
+   class_pick = class->pick_task(rq);
+   if (!class_pick)
+   return NULL;
+
+   if (!cookie) {
+   /*
+* If class_pick is tagged, return it only if it has
+* higher priority than max.
+*/
+   if (max && class_pick->core_cookie &&
+   prio_less(class_pick, max))
+   return idle_sched_class.pick_task(rq);
+
+   return class_pick;
+   }
+
+   /*
+* If class_pick is idle or matches cookie, return early.
+*/
+   if (cookie_equals(class_pick, cookie))
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (prio_less(cookie_pick, class_pick) &&
+   (!max || prio_less(max, class_pick)))
+   return class_pick;
+
+   return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *next, *max = NULL;
+   const struct sched_class *class;
+   const struct cpumask *smt_mask;
+   bool need_sync;
+   int i, j, cpu;
+
+   if (!sched_core_enabled(rq))
+   return __pick_next_task(rq, prev, rf);
+
+   cpu = cpu_of(rq);
+
+   /* Stopper task is switching into idle, no need core-wide selection. */
+   if (cpu_is_offline(cpu)) {
+   /*
+* Reset core_pick so that we don't enter the fastpath when
+* coming online. core_pick would already be migrated to
+* another cpu during offline.
+*/
+   rq->core_pick = NULL;
+   return __pick_next_task(rq, prev, rf);
+   }
+
+   /*
+* If there were no {en,de}queues since we picked (IOW, the task
+* pointers are all still valid), and we haven't scheduled the last
+* pick yet, do so now.
+*
+* rq->core_pick can be NULL if no selection was made for a CPU because
+* it was either offline or went offline during a sibling's core-wide
+* selection. In this case, do a core-wide selection.
+*/
+   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+   rq->core->core_pick_seq != rq->core_sched_seq &&
+   rq->core_pick) {
+   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+   next = 

[PATCH -tip 02/32] sched: Introduce sched_class::pick_task()

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/deadline.c  | 16 ++--
 kernel/sched/fair.c  | 32 +++-
 kernel/sched/idle.c  |  8 
 kernel/sched/rt.c| 15 +--
 kernel/sched/sched.h |  3 +++
 kernel/sched/stop_task.c | 14 --
 6 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f2ea0a3664c..abfc8b505d0d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1867,7 +1867,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct 
rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = >dl;
@@ -1879,7 +1879,18 @@ static struct task_struct *pick_next_task_dl(struct rq 
*rq)
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
p = dl_task_of(dl_se);
-   set_next_task_dl(rq, p, true);
+
+   return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+   struct task_struct *p;
+
+   p = pick_task_dl(rq);
+   if (p)
+   set_next_task_dl(rq, p, true);
+
return p;
 }
 
@@ -2551,6 +2562,7 @@ DEFINE_SCHED_CLASS(dl) = {
 
 #ifdef CONFIG_SMP
.balance= balance_dl,
+   .pick_task  = pick_task_dl,
.select_task_rq = select_task_rq_dl,
.migrate_task_rq= migrate_task_rq_dl,
.set_cpus_allowed   = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52ddfec7cea6..12cf068eeec8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4459,7 +4459,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
 */
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;
 
if (se == curr) {
@@ -7017,6 +7017,35 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+   struct cfs_rq *cfs_rq = >cfs;
+   struct sched_entity *se;
+
+   if (!cfs_rq->nr_running)
+   return NULL;
+
+   do {
+   struct sched_entity *curr = cfs_rq->curr;
+
+   se = pick_next_entity(cfs_rq, NULL);
+
+   if (curr) {
+   if (se && curr->on_rq)
+   update_curr(cfs_rq);
+
+   if (!se || entity_before(curr, se))
+   se = curr;
+   }
+
+   cfs_rq = group_cfs_rq(se);
+   } while (cfs_rq);
+
+   return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags 
*rf)
 {
@@ -11219,6 +11248,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 #ifdef CONFIG_SMP
.balance= balance_fair,
+   .pick_task  = pick_task_fair,
.select_task_rq = select_task_rq_fair,
.migrate_task_rq= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 50e128b899c4..33864193a2f9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -406,6 +406,13 @@ static void set_next_task_idle(struct rq *rq, struct 
task_struct *next, bool fir
schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+   return rq->idle;
+}
+#endif
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
struct task_struct *next = rq->idle;
@@ -473,6 +480,7 @@ DEFINE_SCHED_CLASS(idle) = {
 
 #ifdef CONFIG_SMP
.balance= balance_idle,
+   .pick_task  = pick_task_idle,
.select_task_rq = select_task_rq_idle,
.set_cpus_allowed   = set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a6f9d132c24f..a0e245b0c4bd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1626,7 +1626,7 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct 

[PATCH -tip 06/32] sched: Basic tracking of matching tasks

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Reviewed-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++
 kernel/sched/fair.c   |  46 -
 kernel/sched/sched.h  |  55 
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..344499ab29f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -683,10 +683,16 @@ struct task_struct {
const struct sched_class*sched_class;
struct sched_entity se;
struct sched_rt_entity  rt;
+   struct sched_dl_entity  dl;
+
+#ifdef CONFIG_SCHED_CORE
+   struct rb_node  core_node;
+   unsigned long   core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group   *sched_task_group;
 #endif
-   struct sched_dl_entity  dl;
 
 #ifdef CONFIG_UCLAMP_TASK
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d88bc9a6818..9d521033777f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -78,6 +78,141 @@ __read_mostly int scheduler_running;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+   if (p->sched_class == _sched_class) /* trumps deadline */
+   return -2;
+
+   if (rt_prio(p->prio)) /* includes deadline */
+   return p->prio; /* [-1, 99] */
+
+   if (p->sched_class == _sched_class)
+   return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+   return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+   int pa = __task_prio(a), pb = __task_prio(b);
+
+   if (-pa < -pb)
+   return true;
+
+   if (-pb < -pa)
+   return false;
+
+   if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+   return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+   u64 vruntime = b->se.vruntime;
+
+   /*
+* Normalize the vruntime if tasks are in different cpus.
+*/
+   if (task_cpu(a) != task_cpu(b)) {
+   vruntime -= task_cfs_rq(b)->min_vruntime;
+   vruntime += task_cfs_rq(a)->min_vruntime;
+   }
+
+   return !((s64)(a->se.vruntime - vruntime) <= 0);
+   }
+
+   return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
+{
+   if (a->core_cookie < b->core_cookie)
+   return true;
+
+   if (a->core_cookie > b->core_cookie)
+   return false;
+
+   /* flip prio, so high prio is leftmost */
+   if (prio_less(b, a))
+   return true;
+
+   return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+   struct rb_node *parent, **node;
+   struct task_struct *node_task;
+
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   node = >core_tree.rb_node;
+   parent = *node;
+
+   while (*node) {
+   node_task = container_of(*node, struct task_struct, core_node);
+   parent = *node;
+
+   if (__sched_core_less(p, node_task))
+   node = >rb_left;
+   else
+   node = >rb_right;
+   }
+
+   rb_link_node(>core_node, parent, node);
+   rb_insert_color(>core_node, >core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   rb_erase(>core_node, >core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching 

[PATCH -tip 04/32] sched: Core-wide rq->lock

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Introduce the basic infrastructure to have a core wide rq->lock.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/Kconfig.preempt |   5 ++
 kernel/sched/core.c| 108 +
 kernel/sched/sched.h   |  31 
 3 files changed, 144 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..6d8be4630bd6 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,8 @@ config PREEMPT_COUNT
 config PREEMPTION
bool
select PREEMPT_COUNT
+
+config SCHED_CORE
+   bool "Core Scheduling for SMT"
+   default y
+   depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db5cc05a68bc..6d88bc9a6818 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,70 @@ unsigned int sysctl_sched_rt_period = 100;
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+   bool enabled = !!(unsigned long)data;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   cpu_rq(cpu)->core_enabled = enabled;
+
+   return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+   // XXX verify there are no cookie tasks (yet)
+
+   static_branch_enable(&__sched_core_enabled);
+   stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+   // XXX verify there are no cookie tasks (left)
+
+   stop_machine(__sched_core_stopper, (void *)false, NULL);
+   static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+   mutex_lock(_core_mutex);
+   if (!sched_core_count++)
+   __sched_core_enable();
+   mutex_unlock(_core_mutex);
+}
+
+void sched_core_put(void)
+{
+   mutex_lock(_core_mutex);
+   if (!--sched_core_count)
+   __sched_core_disable();
+   mutex_unlock(_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -4859,6 +4923,42 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+   const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+   struct rq *rq, *core_rq = NULL;
+   int i;
+
+   core_rq = cpu_rq(cpu)->core;
+
+   if (!core_rq) {
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+   if (rq->core && rq->core == rq)
+   core_rq = rq;
+   }
+
+   if (!core_rq)
+   core_rq = cpu_rq(cpu);
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+
+   WARN_ON_ONCE(rq->core && rq->core != core_rq);
+   rq->core = core_rq;
+   }
+   }
+
+   printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -7484,6 +7584,9 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+
+   sched_core_cpu_starting(cpu);
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -7747,6 +7850,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+   rq->core_enabled = 0;
+#endif
}
 
set_load_weight(_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a0dd2b312aa..0dfccf988998 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1061,6 +1061,12 @@ struct rq {
 #endif
unsigned intpush_busy;
struct cpu_stop_workpush_work;
+
+#ifdef CONFIG_SCHED_CORE
+   /* per rq */
+   struct rq   *core;
+   unsigned intcore_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1099,11 +1105,36 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 #endif

[PATCH -tip 05/32] sched/fair: Add a few assertions

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51483a00a755..ca35bfc0a368 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6245,6 +6245,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
task_util = uclamp_task_util(p);
}
 
+   /*
+* per-cpu select_idle_mask usage
+*/
+   lockdep_assert_irqs_disabled();
+
if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
asym_fits_capacity(task_util, target))
return target;
@@ -6710,8 +6715,6 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
@@ -6724,6 +6727,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int wake_flags)
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
 
+   /*
+* required for stable ->cpus_allowed
+*/
+   lockdep_assert_held(>pi_lock);
if (wake_flags & WF_TTWU) {
record_wakee(p);
 
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 01/32] sched: Wrap rq::lock access

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c |  68 -
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  22 
 kernel/sched/debug.c|   4 +-
 kernel/sched/fair.c |  38 +++---
 kernel/sched/idle.c |   4 +-
 kernel/sched/pelt.h |   2 +-
 kernel/sched/rt.c   |  16 +++---
 kernel/sched/sched.h| 108 +---
 kernel/sched/topology.c |   4 +-
 10 files changed, 141 insertions(+), 137 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6aaf9fb3400..db5cc05a68bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 
for (;;) {
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
 
while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -210,7 +210,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(>pi_lock, rf->flags);
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
/*
 *  move_queued_task()  task_rq_lock()
 *
@@ -232,7 +232,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(>pi_lock, rf->flags);
 
while (unlikely(task_on_rq_migrating(p)))
@@ -302,7 +302,7 @@ void update_rq_clock(struct rq *rq)
 {
s64 delta;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -611,7 +611,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (test_tsk_need_resched(curr))
return;
@@ -635,10 +635,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
 
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
-   raw_spin_unlock_irqrestore(>lock, flags);
+   raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -1137,7 +1137,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
struct uclamp_se *uc_se = >uclamp[clamp_id];
struct uclamp_bucket *bucket;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
/* Update task effective clamp */
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1177,7 +1177,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
unsigned int bkt_clamp;
unsigned int rq_clamp;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
/*
 * If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1807,7 +1807,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
   struct task_struct *p, int new_cpu)
 {
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
deactivate_task(rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, new_cpu);
@@ -1973,7 +1973,7 @@ int push_cpu_stop(void *arg)
struct task_struct *p = arg;
 
raw_spin_lock_irq(>pi_lock);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
 
if (task_rq(p) != rq)
goto out_unlock;
@@ -2003,7 +2003,7 @@ int push_cpu_stop(void *arg)
 
 out_unlock:
rq->push_busy = false;
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irq(>pi_lock);
 
put_task_struct(p);
@@ -2056,7 +2056,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct 
cpumask *new_mask, u32
 * Because __kthread_bind() calls this on 

[PATCH -tip 00/32] Core scheduling (v9)

2020-11-17 Thread Joel Fernandes (Google)
Core-Scheduling
===
Enclosed is series v9 of core scheduling.
v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'"))..
I hope that this version is acceptable to be merged (pending any new review
comments that arise) as the main issues in the past are all resolved:
 1. Vruntime comparison.
 2. Documentation updates.
 3. CGroup and per-task interface developed by Google and Oracle.
 4. Hotplug fixes.
Almost all patches also have Reviewed-by or Acked-by tag. See below for full
list of changes in v9.

Introduction of feature
===
Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

Both a CGroup and Per-task interface via prctl(2) are provided for configuring
core sharing. More details are provided in documentation patch.  Kselftests are
provided to verify the correctness/rules of the interface.

Testing
===
ChromeOS testing shows 300% improvement in keypress latency on a Google
docs key press with Google hangout test (the maximum latency drops from 150ms
to 50ms for keypresses).

Julien: TPCC tests showed improvements with core-scheduling as below. With 
kernel
protection enabled, it does not show any regression. Possibly ASI will improve
the performance for those who choose kernel protection (can be toggled through
sched_core_protect_kernel sysctl).
average stdev   diff
baseline (SMT on)   1197.27244.78312824 
core sched (   kernel protect)  412.989545.42734343 -65.51%
core sched (no kernel protect)  686.651571.77756931 -42.65%
nosmt   408.667 39.39042872 -65.87%
(Note these results are from v8).

Vineeth tested sysbench and does not see any regressions.
Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
with uperf that does regress. This appears to be because of ksoftirq heavily
contending with other tasks on the core. The consensus is this can be improved
in the future.

Changes in v9
=
- Note that the vruntime snapshot change is written in 2 patches to show the
  progression of the idea and prevent merge conflicts:
sched/fair: Snapshot the min_vruntime of CPUs on force idle
sched: Improve snapshotting of min_vruntime for CGroups
  Same with the RT priority inversion change:
sched: Fix priority inversion of cookied task with sibling
sched: Improve snapshotting of min_vruntime for CGroups
- Disable coresched on certain AMD HW.

Changes in v8
=
- New interface/API implementation
  - Joel
- Revised kernel protection patch
  - Joel
- Revised Hotplug fixes
  - Joel
- Minor bug fixes and address review comments
  - Vineeth

Changes in v7
=
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
=
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
=
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
=
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
=
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu 

Re: possible deadlock in kill_fasync

2020-11-17 Thread syzbot
syzbot has bisected this issue to:

commit e918188611f073063415f40fae568fa4d86d9044
Author: Boqun Feng 
Date:   Fri Aug 7 07:42:20 2020 +

locking: More accurate annotations for read_lock()

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10b4642950
start commit:   7c8ca812 Add linux-next specific files for 20201117
git tree:   linux-next
final oops: https://syzkaller.appspot.com/x/report.txt?x=12b4642950
console output: https://syzkaller.appspot.com/x/log.txt?x=14b4642950
kernel config:  https://syzkaller.appspot.com/x/.config?x=ff4bc71371dc5b13
dashboard link: https://syzkaller.appspot.com/bug?extid=3e12e14ee01b675e1af2
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=14b1dba650
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=12de863650

Reported-by: syzbot+3e12e14ee01b675e1...@syzkaller.appspotmail.com
Fixes: e918188611f0 ("locking: More accurate annotations for read_lock()")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection


Re: [PATCH net-next 4/4] ptp: ptp_ines: use enum ptp_msg_type

2020-11-17 Thread kernel test robot
Hi Christian,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Christian-Eggers/net-ptp-introduce-enum-ptp_msg_type/20201118-033828
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
72308ecbf33b145641aba61071be31a85ebfd92c
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/c4e4cfcabe3201e2ece90cc9025894e4ed08f069
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Christian-Eggers/net-ptp-introduce-enum-ptp_msg_type/20201118-033828
git checkout c4e4cfcabe3201e2ece90cc9025894e4ed08f069
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All error/warnings (new ones prefixed by >>):

>> drivers/ptp/ptp_ines.c:690:26: error: conflicting types for 'tag_to_msgtype'
 690 | static enum ptp_msg_type tag_to_msgtype(u8 tag)
 |  ^~
   drivers/ptp/ptp_ines.c:178:11: note: previous declaration of 
'tag_to_msgtype' was here
 178 | static u8 tag_to_msgtype(u8 tag);
 |   ^~
>> drivers/ptp/ptp_ines.c:178:11: warning: 'tag_to_msgtype' used but never 
>> defined
   drivers/ptp/ptp_ines.c:690:26: warning: 'tag_to_msgtype' defined but not 
used [-Wunused-function]
 690 | static enum ptp_msg_type tag_to_msgtype(u8 tag)
 |  ^~

vim +/tag_to_msgtype +690 drivers/ptp/ptp_ines.c

   689  
 > 690  static enum ptp_msg_type tag_to_msgtype(u8 tag)
   691  {
   692  switch (tag) {
   693  case MESSAGE_TYPE_SYNC:
   694  return PTP_MSGTYPE_SYNC;
   695  case MESSAGE_TYPE_P_DELAY_REQ:
   696  return PTP_MSGTYPE_PDELAY_REQ;
   697  case MESSAGE_TYPE_P_DELAY_RESP:
   698  return PTP_MSGTYPE_PDELAY_RESP;
   699  case MESSAGE_TYPE_DELAY_REQ:
   700  return PTP_MSGTYPE_DELAY_REQ;
   701  }
   702  return 0xf;
   703  }
   704  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


Re: Ping(3): [PATCH v4] : Add nitems()

2020-11-17 Thread Joseph Myers
On Tue, 17 Nov 2020, Alejandro Colomar via Libc-alpha wrote:

> Nice!
> Please update me on any feedback you receive.

Apparently the author is planning new versions of those papers so WG14 
discussion is waiting for those.

> So glibc will basically hold this patch
> at least until the WG answers to that, right?

I think that whether C2x gets an array-size feature of some kind is 
relevant to whether such a feature goes in glibc and what it looks like in 
glibc, but the fact that it will be considered in WG14 doesn't rule out 
glibc considering such a feature without waiting for WG14.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH] interconnect: qcom: qcs404: Remove gpu and display nodes

2020-11-17 Thread Mike Tipton

On 11/11/2020 2:07 AM, Georgi Djakov wrote:

The following errors are noticed during boot on a QCS404 board:
[2.926647] qcom_icc_rpm_smd_send mas 6 error -6
[2.934573] qcom_icc_rpm_smd_send mas 8 error -6

These errors show when we try to configure the GPU and display nodes,
which are defined in the topology, but these hardware blocks actually
do not exist on QCS404. According to the datasheet, GPU and display
are only present on QCS405 and QCS407.


Even on QCS405/407 where GPU and display are present, you'd still get 
these errors since these particular nodes aren't supported on RPM and 
are purely local. Instead of removing these we should just change their 
mas_rpm_id to -1. It's harmless to leave them in for QCS404 since 
they're only used for path aggregation. The same code can support all 
variants of the QCS400 series. We just wouldn't expect anyone to 
actually vote these paths on QCS404. Similar to how the gcc-qcs404 clock 
provider still registers the GPU and MDP clocks.




Signed-off-by: Georgi Djakov 
---
  drivers/interconnect/qcom/qcs404.c | 9 +++--
  1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/interconnect/qcom/qcs404.c 
b/drivers/interconnect/qcom/qcs404.c
index 9f992422e92f..2ed544e23ff3 100644
--- a/drivers/interconnect/qcom/qcs404.c
+++ b/drivers/interconnect/qcom/qcs404.c
@@ -20,8 +20,6 @@
  
  enum {

QCS404_MASTER_AMPSS_M0 = 1,
-   QCS404_MASTER_GRAPHICS_3D,
-   QCS404_MASTER_MDP_PORT0,
QCS404_SNOC_BIMC_1_MAS,
QCS404_MASTER_TCU_0,
QCS404_MASTER_SPDM,
@@ -156,8 +154,6 @@ struct qcom_icc_desc {
}
  
  DEFINE_QNODE(mas_apps_proc, QCS404_MASTER_AMPSS_M0, 8, 0, -1, QCS404_SLAVE_EBI_CH0, QCS404_BIMC_SNOC_SLV);

-DEFINE_QNODE(mas_oxili, QCS404_MASTER_GRAPHICS_3D, 8, 6, -1, 
QCS404_SLAVE_EBI_CH0, QCS404_BIMC_SNOC_SLV);
-DEFINE_QNODE(mas_mdp, QCS404_MASTER_MDP_PORT0, 8, 8, -1, QCS404_SLAVE_EBI_CH0, QCS404_BIMC_SNOC_SLV >   DEFINE_QNODE(mas_snoc_bimc_1, QCS404_SNOC_BIMC_1_MAS, 8, 76, -1, 

QCS404_SLAVE_EBI_CH0);

  DEFINE_QNODE(mas_tcu_0, QCS404_MASTER_TCU_0, 8, -1, -1, QCS404_SLAVE_EBI_CH0, 
QCS404_BIMC_SNOC_SLV);
  DEFINE_QNODE(mas_spdm, QCS404_MASTER_SPDM, 4, -1, -1, QCS404_PNOC_INT_3);
@@ -231,8 +227,6 @@ DEFINE_QNODE(slv_lpass, QCS404_SLAVE_LPASS, 4, -1, -1, 0);
  
  static struct qcom_icc_node *qcs404_bimc_nodes[] = {

[MASTER_AMPSS_M0] = _apps_proc,
-   [MASTER_OXILI] = _oxili,
-   [MASTER_MDP_PORT0] = _mdp,
[MASTER_SNOC_BIMC_1] = _snoc_bimc_1,
[MASTER_TCU_0] = _tcu_0,
[SLAVE_EBI_CH0] = _ebi,
@@ -460,6 +454,9 @@ static int qnoc_probe(struct platform_device *pdev)
for (i = 0; i < num_nodes; i++) {
size_t j;
  
+		if (!qnodes[i])

+   continue;
+
node = icc_node_create(qnodes[i]->id);
if (IS_ERR(node)) {
ret = PTR_ERR(node);



[PATCH v12 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

2020-11-17 Thread Alex Kogan
In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the MCS lock
holder checks whether the next waiter in the primary queue (if exists) is
running on the same NUMA node. If it is not, that waiter is detached from
the main queue and moved into the tail of the secondary queue. This way,
we gradually filter the primary queue, leaving only waiters running on
the same preferred NUMA node. For more details, see
https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock between waiters in the main queue. This issue will be
addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |  10 +
 arch/x86/Kconfig|  20 ++
 arch/x86/include/asm/qspinlock.h|   4 +
 arch/x86/kernel/alternative.c   |   4 +
 kernel/locking/mcs_spinlock.h   |   2 +-
 kernel/locking/qspinlock.c  |  42 ++-
 kernel/locking/qspinlock_cna.h  | 336 
 7 files changed, 413 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 526d65d..41ee8ff 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3423,6 +3423,16 @@
numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing.
Allowed values are enable and disable
 
+   numa_spinlock=  [NUMA, PV_OPS] Select the NUMA-aware variant
+   of spinlock. The options are:
+   auto - Enable this variant if running on a multi-node
+   machine in native environment.
+   on  - Unconditionally enable this variant.
+   off - Unconditionally disable this variant.
+
+   Not specifying this option is equivalent to
+   numa_spinlock=auto.
+
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b8..6a04cf61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1564,6 +1564,26 @@ config NUMA
 
  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+   bool "Numa-aware spinlocks"
+   depends on NUMA
+   depends on QUEUED_SPINLOCKS
+   depends on 64BIT
+   # For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+   # This is awkward, but hopefully would be resolved once static_call()
+   # is available.
+   depends on PARAVIRT_SPINLOCKS
+   default y
+   help
+ Introduce NUMA (Non Uniform Memory Access) awareness into
+ the slow path of spinlocks.
+
+ In this variant of qspinlock, the kernel will try to keep the lock
+ on the same node, thus reducing the number of remote cache misses,
+ while trading some of the short term fairness for better performance.
+
+ Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
def_bool y
prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index d86ab94..21d09e8 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 
queued_fetch_set_pending_acquire(struct qspinlock *lo
return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2400ad6..e04f48c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -741,6 +741,10 @@ void __init alternative_instructions(void)
   

Re: [PATCH] powerpc: Drop -me200 addition to build flags

2020-11-17 Thread Michael Ellerman
On Mon, 16 Nov 2020 23:09:13 +1100, Michael Ellerman wrote:
> Currently a build with CONFIG_E200=y will fail with:
> 
>   Error: invalid switch -me200
>   Error: unrecognized option -me200
> 
> Upstream binutils has never supported an -me200 option. Presumably it
> was supported at some point by either a fork or Freescale internal
> binutils.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc: Drop -me200 addition to build flags
  https://git.kernel.org/powerpc/c/e02152ba2810f7c88cb54e71cda096268dfa9241

cheers


[PATCH v12 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

2020-11-17 Thread Alex Kogan
Keep track of the time the thread at the head of the secondary queue
has been waiting, and force inter-node handoff once this time passes
a preset threshold. The default value for the threshold (10ms) can be
overridden with the new kernel boot command-line option
"numa_spinlock_threshold". The ms value is translated internally to the
nearest rounded-up jiffies.

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +++
 kernel/locking/qspinlock_cna.h  | 95 +
 2 files changed, 92 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 41ee8ff..3bcf756 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3433,6 +3433,15 @@
Not specifying this option is equivalent to
numa_spinlock=auto.
 
+   numa_spinlock_threshold=[NUMA, PV_OPS]
+   Set the time threshold in milliseconds for the
+   number of intra-node lock hand-offs before the
+   NUMA-aware spinlock is forced to be passed to
+   a thread on another NUMA node.  Valid values
+   are in the [1..100] range. Smaller values result
+   in a more fair, but less performant spinlock,
+   and vice versa. The default value is 10.
+
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 590402a..d3e2754 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -37,6 +37,12 @@
  * gradually filter the primary queue, leaving only waiters running on the same
  * preferred NUMA node.
  *
+ * We change the NUMA node preference after a waiter at the head of the
+ * secondary queue spins for a certain amount of time (10ms, by default).
+ * We do that by flushing the secondary queue into the head of the primary 
queue,
+ * effectively changing the preference to the NUMA node of the waiter at the 
head
+ * of the secondary queue at the time of the flush.
+ *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
  * Authors: Alex Kogan 
@@ -49,13 +55,33 @@ struct cna_node {
u16 real_numa_node;
u32 encoded_tail;   /* self */
u32 partial_order;  /* enum val */
+   s32 start_time;
 };
 
 enum {
LOCAL_WAITER_FOUND,
LOCAL_WAITER_NOT_FOUND,
+   FLUSH_SECONDARY_QUEUE
 };
 
+/*
+ * Controls the threshold time in ms (default = 10) for intra-node lock
+ * hand-offs before the NUMA-aware variant of spinlock is forced to be
+ * passed to a thread on another NUMA node. The default setting can be
+ * changed with the "numa_spinlock_threshold" boot option.
+ */
+#define MSECS_TO_JIFFIES(m)\
+   (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
+static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
+
+static inline bool intra_node_threshold_reached(struct cna_node *cn)
+{
+   s32 current_time = (s32)jiffies;
+   s32 threshold = cn->start_time + intra_node_handoff_threshold;
+
+   return current_time - threshold > 0;
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
struct mcs_spinlock *base = per_cpu_ptr([0].mcs, cpu);
@@ -98,6 +124,7 @@ static __always_inline void cna_init_node(struct 
mcs_spinlock *node)
struct cna_node *cn = (struct cna_node *)node;
 
cn->numa_node = cn->real_numa_node;
+   cn->start_time = 0;
 }
 
 /*
@@ -197,8 +224,15 @@ static void cna_splice_next(struct mcs_spinlock *node,
 
/* stick `next` on the secondary queue tail */
if (node->locked <= 1) { /* if secondary queue is empty */
+   struct cna_node *cn = (struct cna_node *)node;
+
/* create secondary queue */
next->next = next;
+
+   cn->start_time = (s32)jiffies;
+   /* make sure start_time != 0 iff secondary queue is not empty */
+   if (!cn->start_time)
+   cn->start_time = 1;
} else {
/* add to the tail of the secondary queue */
struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
@@ -249,11 +283,15 @@ static __always_inline u32 cna_wait_head_or_lock(struct 
qspinlock *lock,
 {
struct cna_node *cn = (struct cna_node *)node;
 
-   /*
-* Try and put the time otherwise spent spin waiting on
-* _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-*/
-   

[PATCH v12 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic

2020-11-17 Thread Alex Kogan
The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h   | 18 +-
 kernel/locking/qspinlock.c  |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h 
b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf..1eb4d73 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include 
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock) \
+#define arch_mcs_spin_wait(lock)   \
 do {   \
/* Ensure prior stores are observed before we enter wfe. */ \
smp_mb();   \
@@ -14,9 +14,9 @@ do {  
\
wfe();  \
 } while (0)\
 
-#define arch_mcs_spin_unlock_contended(lock)   \
+#define arch_mcs_lock_handoff(lock, val)   \
 do {   \
-   smp_store_release(lock, 1); \
+   smp_store_release((lock), (val));   \
dsb_sev();  \
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h 
b/include/asm-generic/mcs_spinlock.h
index 10cd4ff..f933d99 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_wait(l)
+ *   arch_mcs_lock_handoff(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153..904ba5d 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_wait
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)
\
-do {   \
-   smp_cond_load_acquire(l, VAL);  \
+#define arch_mcs_spin_wait(l)  \
+do {   \
+   smp_cond_load_acquire(l, VAL);  \
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_lock_handoff
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)  \
-   smp_store_release((l), 1)
+#define arch_mcs_lock_handoff(l, val)  \
+   smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct 
mcs_spinlock *node)
WRITE_ONCE(prev->next, node);
 
/* Wait until the lock holder passes the lock down. */
-   arch_mcs_spin_lock_contended(>locked);
+   arch_mcs_spin_wait(>locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct 
mcs_spinlock *node)
}
 
/* Pass lock to next waiter. */
-   arch_mcs_spin_unlock_contended(>locked);
+   arch_mcs_lock_handoff(>locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba..435d696 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -471,7 +471,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
WRITE_ONCE(prev->next, node);
 
pv_wait_node(node, prev);
-   arch_mcs_spin_lock_contended(>locked);
+   arch_mcs_spin_wait(>locked);
 
/*

Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 5:19 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 13:33:42 -0800
> Kees Cook  wrote:
> 
>> As I think got discussed in the thread, what you had here wouldn't work
>> in a CFI build if the function prototype of the call site and the
>> function don't match. (Though I can't tell if .func() is ever called?)
>> 
>> i.e. .func's prototype must match tp_stub_func()'s.
>> 
> 
> 
> Hmm, I wonder how you handle tracepoints? This is called here:
> 
> include/linux/tracepoint.h:
> 
> 
> #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
>   static const char __tpstrtab_##_name[]  \
>   __section("__tracepoints_strings") = #_name;\
>   extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
>   int __traceiter_##_name(void *__data, proto);   \
>   struct tracepoint __tracepoint_##_name  __used  \
>   __section("__tracepoints") = {  \
>   .name = __tpstrtab_##_name, \
>   .key = STATIC_KEY_INIT_FALSE,   \
>   .static_call_key = _CALL_KEY(tp_func_##_name),   \
>   .static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
>   .iterator = &__traceiter_##_name,   \
>   .regfunc = _reg,\
>   .unregfunc = _unreg,\
>   .funcs = NULL };\
>   __TRACEPOINT_ENTRY(_name);  \
>   int __traceiter_##_name(void *__data, proto)\
>   {   \
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
>   \
>   it_func_ptr =   \
>   rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
>   do {\
>   it_func = (it_func_ptr)->func;  \
>   __data = (it_func_ptr)->data;   \
> 
>   ((void(*)(void *, proto))(it_func))(__data, args); \
> 
>    called above 
> 
> Where args is unique for every tracepoint, but func is simply a void
> pointer.

That being said, the called functions have a prototype which match the
caller prototype exactly. So within the tracepoint internal data structures,
this function pointer is indeed a void pointer, but it is cast to a prototype
matching the callees to perform the calls. I suspect that as long as CFI checks
that caller/callees prototypes are compatible at runtime when the actual
calls happen, this all works fine.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>   } while ((++it_func_ptr)->func);\
>   return 0;   \
>   }   \

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 10/24] x86/resctrl: Move the schema names into struct resctrl_schema

2020-11-17 Thread Reinette Chatre

Hi James,

On 10/30/2020 9:11 AM, James Morse wrote:

Move the names used for the schemata file out of the resource and
into struct resctrl_schema. This allows one resource to have two
different names, based on the other schema properties.

This patch copies the names, eventually resctrl will generate them.


Please remove "This patch".



Remove the arch code's max_name_width, this is now resctrl's
problem.

Signed-off-by: James Morse 
---
  arch/x86/kernel/cpu/resctrl/core.c|  9 ++---
  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 10 +++---
  arch/x86/kernel/cpu/resctrl/internal.h|  2 +-
  arch/x86/kernel/cpu/resctrl/rdtgroup.c| 17 -
  include/linux/resctrl.h   |  7 +++
  5 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c 
b/arch/x86/kernel/cpu/resctrl/core.c
index 1ed5e04031e6..cda071009fed 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -37,10 +37,10 @@ DEFINE_MUTEX(rdtgroup_mutex);
  DEFINE_PER_CPU(struct resctrl_pqr_state, pqr_state);
  
  /*

- * Used to store the max resource name width and max resource data width
+ * Used to store the max resource data width
   * to display the schemata in a tabular format
   */
-int max_name_width, max_data_width;
+int max_data_width;
  
  /*

   * Global boolean for rdt_alloc which is true if any
@@ -776,13 +776,8 @@ static int resctrl_offline_cpu(unsigned int cpu)
  static __init void rdt_init_padding(void)
  {
struct rdt_resource *r;
-   int cl;
  
  	for_each_alloc_capable_rdt_resource(r) {

-   cl = strlen(r->name);
-   if (cl > max_name_width)
-   max_name_width = cl;
-
if (r->data_width > max_data_width)
max_data_width = r->data_width;
}


The original code determines the maximum width based on resources 
supported by the platform.



diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c 
b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index a65ff53394ed..28d69c78c29e 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c


...


@@ -391,7 +389,7 @@ static void show_doms(struct seq_file *s, struct 
resctrl_schema *schema, int clo
bool sep = false;
u32 ctrl_val;
  
-	seq_printf(s, "%*s:", max_name_width, r->name);

+   seq_printf(s, "%*s:", RESCTRL_NAME_LEN, schema->name);


From what I understand this changes what some users will see. In the 
original code the "max_name_width" is computed based on the maximum 
length of resources supported. Systems that only support MBA would thus 
show a schemata of:


MB:0=100;1=100

I expect the above change would change the output to:
MB:0=100;1=100



list_for_each_entry(dom, >domains, list) {
hw_dom = resctrl_to_arch_dom(dom);
if (sep)


...



diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8a12f4128209..9f71f0238239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -15,6 +15,11 @@ int proc_resctrl_show(struct seq_file *m,
  
  #endif
  
+/*

+ * The longest name we expect in the schemata file:
+ */
+#define RESCTRL_NAME_LEN   7
+
  enum resctrl_conf_type {
CDP_BOTH,
CDP_CODE,
@@ -172,12 +177,14 @@ struct rdt_resource {
  
  /**

   * @list: Member of resctrl's schema list
+ * @names: Name to use in "schemata" file


s/names/name/?


   * @conf_type:Type of configuration, e.g. code/data/both
   * @res:  The rdt_resource for this entry
   * @num_closidNumber of CLOSIDs available for this resource
   */
  struct resctrl_schema {
struct list_headlist;
+   charname[RESCTRL_NAME_LEN];
enum resctrl_conf_type  conf_type;
struct rdt_resource *res;
u32 num_closid;




Reinette


Re: [PATCH 2/2] fpga: dfl: look for vendor specific capability

2020-11-17 Thread matthew . gerlach




On Tue, 17 Nov 2020, Tom Rix wrote:



On 11/16/20 5:25 PM, matthew.gerl...@linux.intel.com wrote:

From: Matthew Gerlach 

A DFL may not begin at offset 0 of BAR 0.  A PCIe vendor
specific capability can be used to specify the start of a
number of DFLs.

Signed-off-by: Matthew Gerlach 
---
 Documentation/fpga/dfl.rst | 10 +
 drivers/fpga/dfl-pci.c | 88 +-
 2 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index 0404fe6ffc74..c81ceb1e79e2 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -501,6 +501,16 @@ Developer only needs to provide a sub feature driver with 
matched feature id.
 FME Partial Reconfiguration Sub Feature driver (see drivers/fpga/dfl-fme-pr.c)
 could be a reference.

+Location of DFLs on PCI bus
+===
+The start of the DFL is assumed to be offset 0 of bar 0.
+Alternatively, a vendor specific capability structure can be used to
+specify the location of one or more DFLs.  Intel has reserved the
+vendor specific id of 0x43 for this purpose.  The vendor specific
+data begins with a 4 byte count of the number of DFLs followed 4 byte
+Offset/BIR fields for each DFL. Bits 2:0 of Offset/BIR field indicates
+the BAR, and bits 31:3 form the 8 byte aligned offset where bits 2:0 are
+zero.



Does the 'Device Feature List (DFL) Overview' section need to change ?


The 'Device Feature List (DFL) Overview' section does not really mention
the starting location of the DFLs.  I think a section on the discussing
the starting location is enough.



Maybe some more ascii art on location of bar0 vs vendor specific ?


I've added some clarity in v2 which might be enough.




 Open discussion
 ===
diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
index b1b157b41942..5418e8bf2496 100644
--- a/drivers/fpga/dfl-pci.c
+++ b/drivers/fpga/dfl-pci.c
@@ -27,6 +27,13 @@
 #define DRV_VERSION"0.8"

Since basic pci functionality is changing, consider incrementing this version.

 #define DRV_NAME   "dfl-pci"

+#define PCI_VNDR_ID_DFLS 0x43
+
+#define PCI_VNDR_DFLS_CNT_OFFSET 8
+#define PCI_VNDR_DFLS_RES_OFFSET 0x0c
+
+#define PCI_VND_DFLS_RES_BAR_MASK 0x7

Is this missing a R? PCI_VNDR_DFLS_RES_BAR_MASK ?


Good catch!.  Will fix in v2.


+
 struct cci_drvdata {
struct dfl_fpga_cdev *cdev; /* container device */
 };
@@ -119,6 +126,82 @@ static int *cci_pci_create_irq_table(struct pci_dev 
*pcidev, unsigned int nvec)
return table;
 }

+static int find_dfl_in_cfg(struct pci_dev *pcidev,
+  struct dfl_fpga_enum_info *info)
+{
+   u32 bar, offset, vndr_hdr, dfl_cnt, dfl_res;
+   int dfl_res_off, i, voff = 0;
+   resource_size_t start, len;
+
+   while ((voff = pci_find_next_ext_capability(pcidev, voff, 
PCI_EXT_CAP_ID_VNDR))) {
+

extra nl

Ok, fix in v2.


+   pci_read_config_dword(pcidev, voff + PCI_VNDR_HEADER, 
_hdr);


A general problem.

Return of pci_read is not checked, nor are the values ex/ vndr_hdr initialized.


In v2 the variables will be initialized to invalid values that will be 
caught with the existing checks.





+
+   dev_dbg(>dev,
+   "vendor-specific capability id 0x%x, rev 0x%x len 
0x%x\n",
+   PCI_VNDR_HEADER_ID(vndr_hdr),
+   PCI_VNDR_HEADER_REV(vndr_hdr),
+   PCI_VNDR_HEADER_LEN(vndr_hdr));
+
+   if (PCI_VNDR_HEADER_ID(vndr_hdr) == PCI_VNDR_ID_DFLS)
+   break;
+   }
+
+   if (!voff) {
+   dev_dbg(>dev, "%s no VSEC found\n", __func__);
+   return -ENODEV;
+   }
+
+   pci_read_config_dword(pcidev, voff + PCI_VNDR_DFLS_CNT_OFFSET, 
_cnt);
+   dev_info(>dev, "dfl_cnt %d\n", dfl_cnt);
+   for (i = 0; i < dfl_cnt; i++) {

Is there a upper limit on the dfl_cnt ? maybe PCI_STD_NUM_BARS ?


Technically, there could be more than one DFL in a bar.  I don't
really know what criteria constitutes an upper limit.


+   dfl_res_off = voff + PCI_VNDR_DFLS_RES_OFFSET +
+ (i * sizeof(dfl_res));
+   pci_read_config_dword(pcidev, dfl_res_off, _res);
+
+   dev_dbg(>dev, "dfl_res 0x%x\n", dfl_res);
+
+   bar = dfl_res & PCI_VND_DFLS_RES_BAR_MASK;

an extra nl, fix the similar ones as well.

+
+   if (bar >= PCI_STD_NUM_BARS) {
+   dev_err(>dev, "%s bad bar number %d\n",
+   __func__, bar);
+   return -EINVAL;
+   }
+
+   len = pci_resource_len(pcidev, bar);
+
+   if (len == 0) {
+   dev_err(>dev, "%s unmapped bar number %d\n",
+   __func__, bar);
+   return -EINVAL;
+   }
+
+   offset = dfl_res & 

Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-17 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 5:10 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
> Mathieu Desnoyers  wrote:

[...]

>> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
>> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
>> func pointer can be updated and loaded concurrently.
> 
> I thought about this a little, and then only thing we really should worry
> about is synchronizing with those that unregister. Because when we make
> this update, there are now two states. the __DO_TRACE either reads the
> original func or the stub. And either should be OK to call.
> 
> Only the func gets updated and not the data. So what exactly are we worried
> about here?

Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.

However, if we want to compare the function pointer to some other value and
conditionally do (or skip) the call, I think you'll need the 
READ_ONCE/WRITE_ONCE
to make sure the pointer is not re-fetched between comparison and call.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 5:16 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 16:22:23 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> If we don't call the stub, then there is no point in having the stub at
>> all, and we should just compare to a constant value, e.g. 0x1UL. As far
>> as I can recall, comparing with a small immediate constant is more efficient
>> than comparing with a loaded value on many architectures.
> 
> Why 0x1UL, and not just set it to NULL.
> 
>   do {\
>   it_func = (it_func_ptr)->func;  \
>   __data = (it_func_ptr)->data;   \
>   if (likely(it_func))\
>   ((void(*)(void *, proto))(it_func))(__data, 
> args); \
>   } while ((++it_func_ptr)->func);

Because of this end-of-loop condition ^
which is also testing for a NULL func. So if we reach a stub, we end up stopping
iteration and not firing the following tracepoint probes.

Thanks,

Mathieu

> 
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] rtc: destroy mutex when releasing the device

2020-11-17 Thread Alexandre Belloni
On Tue, 10 Nov 2020 10:42:05 +0100, Bartosz Golaszewski wrote:
> Not destroying mutexes doesn't lead to resource leak but it's the correct
> thing to do for mutex debugging accounting.

Applied, thanks!

[1/1] rtc: destroy mutex when releasing the device
  commit: b3527837a60a5dcd0c16c28804b6ec9b47f15947

Best regards,
-- 
Alexandre Belloni 


Re: [RFC PATCH v3 9/9] ipu3-cio2: Add functionality allowing software_node connections to sensors on platforms designed for Windows

2020-11-17 Thread Dan Scally
On 17/11/2020 16:42, Andy Shevchenko wrote:
> On Tue, Nov 17, 2020 at 2:02 PM Dan Scally  wrote:
>> On 16/11/2020 16:16, Andy Shevchenko wrote:
>>> On Mon, Nov 16, 2020 at 02:15:01PM +, Dan Scally wrote:
 On 16/11/2020 14:10, Laurent Pinchart wrote:
> I thought we were looking for ACPI devices, not companion devices, in
> order to extract information from the DSDT and store it in a software
> node. I could very well be wrong though.
 This is correct - the code to fetch the various resources we're looking
 at all uses acpi_device. Whether using Andy's iterator suggestions or
 previous bus_for_each_dev(_bus_type...) I'm just getting the
 acpi_device via to_acpi_dev() and using that.
>>> If you try to get an I²C ore SPI device out of pure ACPI device (with given
>>> APCI _HID) you will fail. So, it's not correct. You are retrieving companion
>>> devices, while they are still in the struct acpi_device.
>>>
>>> And don't ask me, why it's so. I wasn't designed that and didn't affect any
>>> decision made there.
>> Well, in terms of the actual device we're getting, I don't think we're
>> fundamentally doing anything different between the methods...unless I'm
>> really mistaken.
>>
>>
>> Originally implementation was like:
>>
>>
>> const char *supported_devices[] = {
>>
>> "OVTI2680",
>>
>> };
>>
>>
>> static int cio2_bridge_connect_supported_devices(void)
>>
>> {
>>
>> struct acpi_device *adev;
>>
>> int i;
>>
>> for (i = 0; i < ARRAY_SIZE(supported_devices); i++) {
>>
>> adev =
>> acpi_dev_get_first_match_dev(supported_devices[i], NULL, -1);
>>
>> ...
>>
>> }
>>
>>
>> and acpi_dev_get_first_match_dev() likewise just returns adev via
>> to_acpi_device(dev).
>>
>>
>> So, maybe we don't need to do the iterating over all devices with
>> matching _HID at all, in which case it can be dropped, but if we're
>> doing it then I can't see that it's different to the original
>> implementation in terms of the struct acpi_device we're working with or
>> the route taken to get it.
>>
>>
>> Either way; ACPI maintainers asked to be CC'd on the next patchset
>> anyway, so they'll see what we're doing and be able to weigh in.
> Implementation wise the two approaches are quite similar for now, indeed.
> I would rather go with an iterator approach for a simple reason, EFI
> code already has something which may utilize iterators rather than
> using their home grown solution.
Alright - let's stick with that approach and leave the handling multiple
sensors of same model in then. That's the current state of the code
anyway, and it means it can be used elsewhere too.


Re: Ping(3): [PATCH v4] : Add nitems()

2020-11-17 Thread Alejandro Colomar
Hi Joseph,

On 11/17/20 11:44 PM, Joseph Myers wrote:
> I've asked the WG14 reflector why N2529 (and N2530, though that one's not
> relevant to this feature) doesn't seem to have made it onto a meeting
> agenda yet, when there have been two WG14 meetings since that proposal
was
> made and a third one coming up.
>

Nice!
Please update me on any feedback you receive.

So glibc will basically hold this patch
at least until the WG answers to that, right?

Thanks,

Alex


[GIT PULL] Kunit fixes update for Linux 5.10-rc5

2020-11-17 Thread Shuah Khan

Hi Linus,

Please pull the following Kunit fixes update for Linux 5.10-rc5.

This Kunit update for Linux 5.10-rc5 consists of several fixes Kunit
documentation, tool, compile time fixes not pollute source directory,
and fix to remove tools/testing/kunit/.gitattributes file.

diff is attached.

Brendan fixed the weirdness with tools/testing/kunit/.gitattributes
file. Thanks for noticing it.

thanks,
-- Shuah


The following changes since commit 0d0d245104a42e593adcf11396017a6420c08ba8:

  kunit: tools: fix kunit_tool tests for parsing test plans (2020-10-26 
13:25:40 -0600)


are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest 
tags/linux-kselftest-kunit-fixes-5.10-rc5


for you to fetch changes up to 3084db0e0d5076cd48408274ab0911cd3ccdae88:

  kunit: fix display of failed expectations for strings (2020-11-10 
13:45:15 -0700)



linux-kselftest-kunit-fixes-5.10-rc5

This Kunit update for Linux 5.10-rc5 consists of several fixes Kunit
documentation, tool, compile time fixes not pollute source directory,
and fix to remove tools/testing/kunit/.gitattributes file.


Andy Shevchenko (2):
  kunit: Do not pollute source directory with generated files 
(.kunitconfig)
  kunit: Do not pollute source directory with generated files 
(test.log)


Brendan Higgins (1):
  kunit: tool: unmark test_data as binary blobs

Daniel Latypov (4):
  kunit: tool: fix pre-existing python type annotation errors
  kunit: tool: print out stderr from make (like build warnings)
  kunit: tool: fix extra trailing \n in raw + parsed test output
  kunit: fix display of failed expectations for strings

David Gow (1):
  kunit: Fix kunit.py parse subcommand (use null build_dir)

Randy Dunlap (3):
  KUnit: Docs: fix a wording typo
  KUnit: Docs: style: fix some Kconfig example issues
  KUnit: Docs: usage: wording fixes

 Documentation/dev-tools/kunit/faq.rst   |  2 +-
 Documentation/dev-tools/kunit/style.rst | 18 +--
 Documentation/dev-tools/kunit/usage.rst | 10 +++
 include/kunit/test.h|  2 +-
 tools/testing/kunit/.gitattributes  |  1 -
 tools/testing/kunit/kunit.py| 27 -
 tools/testing/kunit/kunit_kernel.py | 53 
+

 tools/testing/kunit/kunit_parser.py | 17 ++-
 tools/testing/kunit/kunit_tool_test.py  |  4 +--
 9 files changed, 80 insertions(+), 54 deletions(-)
 delete mode 100644 tools/testing/kunit/.gitattributes



diff --git a/Documentation/dev-tools/kunit/faq.rst b/Documentation/dev-tools/kunit/faq.rst
index 1628862e7024..8d5029ad210a 100644
--- a/Documentation/dev-tools/kunit/faq.rst
+++ b/Documentation/dev-tools/kunit/faq.rst
@@ -90,7 +90,7 @@ things to try.
re-run kunit_tool.
 5. Try to run ``make ARCH=um defconfig`` before running ``kunit.py run``. This
may help clean up any residual config items which could be causing problems.
-6. Finally, try running KUnit outside UML. KUnit and KUnit tests can run be
+6. Finally, try running KUnit outside UML. KUnit and KUnit tests can be
built into any kernel, or can be built as a module and loaded at runtime.
Doing so should allow you to determine if UML is causing the issue you're
seeing. When tests are built-in, they will execute when the kernel boots, and
diff --git a/Documentation/dev-tools/kunit/style.rst b/Documentation/dev-tools/kunit/style.rst
index da1d6f0ed6bc..8dbcdc552606 100644
--- a/Documentation/dev-tools/kunit/style.rst
+++ b/Documentation/dev-tools/kunit/style.rst
@@ -175,17 +175,17 @@ An example Kconfig entry:
 
 .. code-block:: none
 
-config FOO_KUNIT_TEST
-tristate "KUnit test for foo" if !KUNIT_ALL_TESTS
-depends on KUNIT
-default KUNIT_ALL_TESTS
-help
-This builds unit tests for foo.
+	config FOO_KUNIT_TEST
+		tristate "KUnit test for foo" if !KUNIT_ALL_TESTS
+		depends on KUNIT
+		default KUNIT_ALL_TESTS
+		help
+		  This builds unit tests for foo.
 
-For more information on KUnit and unit tests in general, please refer
-to the KUnit documentation in Documentation/dev-tools/kunit
+		  For more information on KUnit and unit tests in general, please refer
+		  to the KUnit documentation in Documentation/dev-tools/kunit/.
 
-If unsure, say N
+		  If unsure, say N.
 
 
 Test File and Module Names
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index 62142a47488c..9c28c518e6a3 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -92,7 +92,7 @@ 

Re: [PATCH] selftests/seccomp: sh: Fix register names

2020-11-17 Thread John Paul Adrian Glaubitz
On 11/17/20 9:56 PM, Kees Cook wrote:
> It looks like the seccomp selftests were never actually built for sh.
> This fixes it, though I don't have an environment to do a runtime test
> of it yet.
> 
> Fixes: 0bb605c2c7f2b4b3 ("sh: Add SECCOMP_FILTER")
> Signed-off-by: Kees Cook 
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 7f7ecfcd66db..26c72f2b61b1 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1804,8 +1804,8 @@ TEST_F(TRACE_poke, getpid_runs_normally)
>  #define SYSCALL_RET(_regs)   (_regs).a[(_regs).windowbase * 4 + 2]
>  #elif defined(__sh__)
>  # define ARCH_REGS   struct pt_regs
> -# define SYSCALL_NUM(_regs)  (_regs).gpr[3]
> -# define SYSCALL_RET(_regs)  (_regs).gpr[0]
> +# define SYSCALL_NUM(_regs)  (_regs).regs[3]
> +# define SYSCALL_RET(_regs)  (_regs).regs[0]
>  #else
>  # error "Do not know how to find your architecture's registers and syscalls"
>  #endif

Yes, this fix is indeed necessary. However, there is another build issue that I 
ran into
and I'm not sure why it happens, but commenting out "#include " 
in
../clone3/clone3_selftests.h fixes it.

root@tirpitz:..selftests/seccomp> make
gcc -Wl,-no-as-needed -Wall  -lpthread  seccomp_bpf.c 
/usr/src/linux-5.9.8/tools/testing/selftests/kselftest_harness.h 
/usr/src/linux-5.9.8/tools/testing/selftests/kselftest.h  -o 
/usr/src/linux-5.9.8/tools/testing/selftests/seccomp/seccomp_bpf
In file included from seccomp_bpf.c:55:
../clone3/clone3_selftests.h:28:8: error: redefinition of ‘struct clone_args’
   28 | struct clone_args {
  |^~
In file included from ../clone3/clone3_selftests.h:8,
 from seccomp_bpf.c:55:
/usr/include/linux/sched.h:92:8: note: originally defined here
   92 | struct clone_args {
  |^~
make: *** [../lib.mk:140: 
/usr/src/linux-5.9.8/tools/testing/selftests/seccomp/seccomp_bpf] Error 1
root@tirpitz:..selftests/seccomp>

Your actual register naming fix is correct in any case as without your patch, 
building the seccomp
selftest fails with:

seccomp_bpf.c: In function ‘get_syscall’:
seccomp_bpf.c:1741:37: error: ‘struct pt_regs’ has no member named ‘gpr’; did 
you mean ‘pr’?
 1741 | # define SYSCALL_NUM(_regs) (_regs).gpr[3]
  | ^~~
seccomp_bpf.c:1794:9: note: in expansion of macro ‘SYSCALL_NUM’
 1794 |  return SYSCALL_NUM(regs);
  | ^~~
seccomp_bpf.c: In function ‘change_syscall’:
seccomp_bpf.c:1741:37: error: ‘struct pt_regs’ has no member named ‘gpr’; did 
you mean ‘pr’?
 1741 | # define SYSCALL_NUM(_regs) (_regs).gpr[3]
  | ^~~
seccomp_bpf.c:1817:3: note: in expansion of macro ‘SYSCALL_NUM’
 1817 |   SYSCALL_NUM(regs) = syscall;
  |   ^~~
seccomp_bpf.c:1742:37: error: ‘struct pt_regs’ has no member named ‘gpr’; did 
you mean ‘pr’?
 1742 | # define SYSCALL_RET(_regs) (_regs).gpr[0]
  | ^~~
seccomp_bpf.c:1859:3: note: in expansion of macro ‘SYSCALL_RET’
 1859 |   SYSCALL_RET(regs) = result;
  |   ^~~
seccomp_bpf.c: In function ‘get_syscall’:
seccomp_bpf.c:1795:1: warning: control reaches end of non-void function 
[-Wreturn-type]
 1795 | }
  | ^
make: *** [../lib.mk:140: 
/usr/src/linux-5.9.8/tools/testing/selftests/seccomp/seccomp_bpf] Error 1

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: [PATCH 0/8] rtc: rework resource management

2020-11-17 Thread Alexandre Belloni
On Mon, 9 Nov 2020 17:34:01 +0100, Bartosz Golaszewski wrote:
> As discussed: this is my take on RTC devres. The series does a couple things:
> it adds missing documentation of managed RTC functions, adds the 'devm_' 
> prefix
> to managed APIs, makes the rtc_device struct unaware of being managed (removes
> the registered field) and also shrinks devm_rtc_allocate_device().
> 
> Other than that, there are some semi-related patches in here, like the one
> using the managed variant of pinctrl_register() in rtc-omap and another one
> adding a generic error message when nvmem registraton fails.
> 
> [...]

Applied, thanks!

[1/8] rtc: omap: use devm_pinctrl_register()
  commit: a2d41cdac82332c52c13cb2ef225804eabd5a17c
[3/8] Documentation: list RTC devres helpers in devres.rst
  commit: 9700e2835743c98a9867711133c64cc7f57be477
[4/8] rtc: nvmem: remove nvram ABI
  commit: 003006f324d265b69afd18496bc06ee076c70d72
[5/8] rtc: add devm_ prefix to rtc_nvmem_register()
  commit: ae1907b39574c545e4c5f0e038e85d57f6358080
[6/8] rtc: nvmem: emit an error message when nvmem registration fails
  commit: ffb1ecf7f7cc3e0b6f3fc7f445cc405ccb52d048
[7/8] rtc: rework rtc_register_device() resource management
  commit: 9703f757249afc4c558c8712e953b5b33a73e379
[8/8] rtc: shrink devm_rtc_allocate_device()
  commit: 27f554c580c8ec9015aec3d998510cf462534e48

Best regards,
-- 
Alexandre Belloni 


Re: [PATCH] iommu/amd: Enforce 4k mapping for certain IOMMU data structures

2020-11-17 Thread Will Deacon
On Wed, Oct 28, 2020 at 11:18:24PM +, Suravee Suthikulpanit wrote:
> AMD IOMMU requires 4k-aligned pages for the event log, the PPR log,
> and the completion wait write-back regions. However, when allocating
> the pages, they could be part of large mapping (e.g. 2M) page.
> This causes #PF due to the SNP RMP hardware enforces the check based
> on the page level for these data structures.

Please could you include an example backtrace here?

> So, fix by calling set_memory_4k() on the allocated pages.

I think I'm missing something here. set_memory_4k() will break the kernel
linear mapping up into page granular mappings, but the IOMMU isn't using
that mapping, right? It's just using the physical address returned by
iommu_virt_to_phys(), so why does it matter?

Just be nice to capture some of this rationale in the log, especially as
I'm not familiar with this device.

> Fixes: commit c69d89aff393 ("iommu/amd: Use 4K page for completion wait 
> write-back semaphore")

I couldn't figure out how that commit could cause this problem. Please can
you explain that to me?

Cheers,

Will


Re: [PATCH RFC v2 0/5] dwmac-meson8b: picosecond precision RX delay support

2020-11-17 Thread Martin Blumenstingl
Hi Kevin,

On Sun, Nov 15, 2020 at 7:52 PM Martin Blumenstingl
 wrote:
[...]
> I have tested this on an X96 Air 4GB board (not upstream yet).
[...]
> Also I have tested this on a X96 Max board without any .dts changes

can you please add this series to your testing branch?
I am interested in feedback from Kernel CI for all the boards which
are there as well as any other testing bots


Thank you!
Best regards,
Martin


Re: [PATCH 08/24] x86/resctrl: Walk the resctrl schema list instead of an arch list

2020-11-17 Thread Reinette Chatre

Hi James,

On 10/30/2020 9:11 AM, James Morse wrote:

Now that resctrl has its own list of resources it is using, walk that
list instead of the architectures list. This means resctrl has somewhere
to keep schema properties with the resource that is using them.

Most users of for_each_alloc_enabled_rdt_resource() are per-schema,
and also want a schema property, like the conf_type. Switch these to
walk the schema list. Schema were only created for alloc_enabled
resources so these two lists are currently equivalent.



From what I understand based on this description the patch will 
essentially change instances of for_each_alloc_enabled_rdt_resource() to 
walking the schema list 



Signed-off-by: James Morse 
---
  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 38 ++-
  arch/x86/kernel/cpu/resctrl/internal.h|  6 ++--
  arch/x86/kernel/cpu/resctrl/rdtgroup.c| 34 +---
  include/linux/resctrl.h   |  5 +--
  4 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c 
b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8ac104c634fe..d3f9d142f58a 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -57,9 +57,10 @@ static bool bw_validate(char *buf, unsigned long *data, 
struct rdt_resource *r)
return true;
  }
  
-int parse_bw(struct rdt_parse_data *data, struct rdt_resource *r,

+int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 struct rdt_domain *d)
  {
+   struct rdt_resource *r = s->res;
unsigned long bw_val;
  
  	if (d->have_new_ctrl) {


... this change and also the ones to parse_cbm() and 
rdtgroup_cbm_overlaps() are not clear to me because it seems they 
replace the rdt_resource parameter with resctrl_schema, but all in turn 
use that to access rdt_resource again. That seems unnecessary?



@@ -125,10 +126,11 @@ static bool cbm_validate(char *buf, u32 *data, struct 
rdt_resource *r)
   * Read one cache bit mask (hex). Check that it is valid for the current
   * resource type.
   */
-int parse_cbm(struct rdt_parse_data *data, struct rdt_resource *r,
+int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
  struct rdt_domain *d)
  {
struct rdtgroup *rdtgrp = data->rdtgrp;
+   struct rdt_resource *r = s->res;
u32 cbm_val;
  
  	if (d->have_new_ctrl) {


Really needed?


@@ -160,12 +162,12 @@ int parse_cbm(struct rdt_parse_data *data, struct 
rdt_resource *r,
 * The CBM may not overlap with the CBM of another closid if
 * either is exclusive.
 */
-   if (rdtgroup_cbm_overlaps(r, d, cbm_val, rdtgrp->closid, true)) {
+   if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) {
rdt_last_cmd_puts("Overlaps with exclusive group\n");
return -EINVAL;
}
  
-	if (rdtgroup_cbm_overlaps(r, d, cbm_val, rdtgrp->closid, false)) {

+   if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) {
if (rdtgrp->mode == RDT_MODE_EXCLUSIVE ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
rdt_last_cmd_puts("Overlaps with other group\n");


Needed?


@@ -185,9 +187,10 @@ int parse_cbm(struct rdt_parse_data *data, struct 
rdt_resource *r,
   * separated by ";". The "id" is in decimal, and must match one of
   * the "id"s for this resource.
   */
-static int parse_line(char *line, struct rdt_resource *r,
+static int parse_line(char *line, struct resctrl_schema *s,
  struct rdtgroup *rdtgrp)
  {
+   struct rdt_resource *r = s->res;
struct rdt_parse_data data;
char *dom = NULL, *id;
struct rdt_domain *d;
@@ -213,7 +216,8 @@ static int parse_line(char *line, struct rdt_resource *r,
if (d->id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
-   if (r->parse_ctrlval(, r, d))
+
+   if (r->parse_ctrlval(, s, d))
return -EINVAL;
if (rdtgrp->mode ==  RDT_MODE_PSEUDO_LOCKSETUP) {



needed?

/*

@@ -289,10 +293,12 @@ static int rdtgroup_parse_resource(char *resname, char 
*tok,
struct resctrl_schema *s;
struct rdt_resource *r;
  
+	lockdep_assert_held(_mutex);

+


It is not clear how this addition fits into patch.


list_for_each_entry(s, _all_schema, list) {
r = s->res;
if (!strcmp(resname, r->name) && rdtgrp->closid < s->num_closid)
-   return parse_line(tok, r, rdtgrp);
+   return parse_line(tok, s, rdtgrp);
}



needed? (similar comments to other changes in this patch but I will stop 
here)



rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", 
resname);
return 

Re: [Freedreno] [PATCH] drm/msm/dpu: Remove chatty vbif debug print

2020-11-17 Thread Stephen Boyd
Quoting abhin...@codeaurora.org (2020-11-17 12:34:56)
> On 2020-11-17 09:26, Stephen Boyd wrote:
> > I don't know what this debug print is for but it is super chatty,
> > throwing 8 lines of debug prints in the logs every time we update a
> > plane. It looks like it has no value. Let's nuke it so we can get
> > better logs.
> > 
> > Cc: Sean Paul 
> > Cc: Abhinav Kumar 
> > Signed-off-by: Stephen Boyd 
> 
> > ---
> >  drivers/gpu/drm/msm/disp/dpu1/dpu_vbif.c | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_vbif.c
> > b/drivers/gpu/drm/msm/disp/dpu1/dpu_vbif.c
> > index 5e8c3f3e6625..5eb2b2ee09f5 100644
> > --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_vbif.c
> > +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_vbif.c
> > @@ -245,9 +245,6 @@ void dpu_vbif_set_qos_remap(struct dpu_kms 
> > *dpu_kms,
> >   forced_on = mdp->ops.setup_clk_force_ctrl(mdp, params->clk_ctrl, 
> > true);
> > 
> >   for (i = 0; i < qos_tbl->npriority_lvl; i++) {
> > - DPU_DEBUG("vbif:%d xin:%d lvl:%d/%d\n",
> > - params->vbif_idx, params->xin_id, i,
> > - qos_tbl->priority_lvl[i]);
> 
> Instead of getting rid of this print, we should optimize the caller of 
> this.

Does the print tell us anything? Right now it prints 8 lines where it
feels like it could be trimmed down:

   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:0/3
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:1/3
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:2/4
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:3/4
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:4/5
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:5/5
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:6/6
   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 lvl:7/6

maybe one line that combines the index into values?

   [drm:dpu_vbif_set_qos_remap] vbif:0 xin:0 [3 3 4 4 5 5 6 6]

But again I have no idea if this print is really useful. Maybe we can
print it only if the value changes from what was already there?
Basically move the print into dpu_hw_set_qos_remap() and then skip out
early if nothing changed or print and modify the register.

> This is what
> we are doing in downstream. So we need to update the property only if we 
> are switching from a RT client
> to non-RT client for the plane and vice-versa. So we should try to do 
> the same thing here.
> 
> is_rt = sde_crtc_is_rt_client(crtc, crtc->state);
> if (is_rt != psde->is_rt_pipe) {
> psde->is_rt_pipe = is_rt;
> pstate->dirty |= SDE_PLANE_DIRTY_QOS;
> }
> 
> 
> if (pstate->dirty & DPU_PLANE_DIRTY_QOS)
> _dpu_plane_set_qos_remap(plane);
> 

Sounds great! Can you send the patch?


Re: [PATCH] phy: amlogic: Replace devm_reset_control_array_get()

2020-11-17 Thread Martin Blumenstingl
Hi Yejune,

On Tue, Nov 17, 2020 at 6:58 AM Yejune Deng  wrote:
>
> devm_reset_control_array_get_exclusive() looks more readable
>
> Signed-off-by: Yejune Deng 
> ---
>  drivers/phy/amlogic/phy-meson-axg-pcie.c   | 2 +-
>  drivers/phy/amlogic/phy-meson-g12a-usb3-pcie.c | 2 +-
>  drivers/soc/amlogic/meson-ee-pwrc.c| 3 +--
>  drivers/soc/amlogic/meson-gx-pwrc-vpu.c| 2 +-
what's the reason behind including PHY and SoC driver changes in one patch?
I think both main directories will be applied by different
maintainers, so I believe the patch should be split into:
- one drivers/phy patch
- another drivers/soc patch


Best regards,
Martin


Re: Ping(3): [PATCH v4] : Add nitems()

2020-11-17 Thread Joseph Myers
I've asked the WG14 reflector why N2529 (and N2530, though that one's not 
relevant to this feature) doesn't seem to have made it onto a meeting 
agenda yet, when there have been two WG14 meetings since that proposal was 
made and a third one coming up.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH RFC v2 3/5] net: stmmac: dwmac-meson8b: use picoseconds for the RGMII RX delay

2020-11-17 Thread Martin Blumenstingl
Hi Florian,

On Tue, Nov 17, 2020 at 7:36 PM Florian Fainelli  wrote:
>
> On 11/15/20 10:52 AM, Martin Blumenstingl wrote:
> > Amlogic Meson G12A, G12B and SM1 SoCs have a more advanced RGMII RX
> > delay register which allows picoseconds precision. Parse the new
> > "amlogic,rgmii-rx-delay-ps" property or fall back to the old
> > "amlogic,rx-delay-ns".
> >
> > Signed-off-by: Martin Blumenstingl 
>
> Reviewed-by: Florian Fainelli 
first of all: thanks for reviewing this (and the rest of the series)!

> Maybe also issue a warning when the 'amlogic,rx-delay-ns' property is
> found in addition to the 'amlogic,rgmii-rx-delay-ps'? Up to you how to
> manage existing DTBs being deployed.
none of the upstream DTBs uses amlogic,rx-delay-ns - and I am also not
aware of anything being in use "downstream".
I will add a sentence to the commit description when I re-send this
without RFC, something along those lines: "No upstream DTB uses the
old amlogic,rx-delay-ns (yet). Only include minimalistic logic to fall
back to the old property, without any special validation (for example:
old and new property are given at the same time)"

What do you think?


Best regards,
Martin


Re: [PATCH bpf-next v3 2/2] bpf: Add tests for bpf_lsm_set_bprm_opts

2020-11-17 Thread Daniel Borkmann

On 11/17/20 3:13 AM, KP Singh wrote:
[...]

+
+static int run_set_secureexec(int map_fd, int secureexec)
+{
+


^ same here


+   int child_pid, child_status, ret, null_fd;
+
+   child_pid = fork();
+   if (child_pid == 0) {
+   null_fd = open("/dev/null", O_WRONLY);
+   if (null_fd == -1)
+   exit(errno);
+   dup2(null_fd, STDOUT_FILENO);
+   dup2(null_fd, STDERR_FILENO);
+   close(null_fd);
+
+   /* Ensure that all executions from hereon are
+* secure by setting a local storage which is read by
+* the bprm_creds_for_exec hook and sets bprm->secureexec.
+*/
+   ret = update_storage(map_fd, secureexec);
+   if (ret)
+   exit(ret);
+
+   /* If the binary is executed with securexec=1, the dynamic
+* loader ingores and unsets certain variables like LD_PRELOAD,
+* TMPDIR etc. TMPDIR is used here to simplify the example, as
+* LD_PRELOAD requires a real .so file.
+*
+* If the value of TMPDIR is set, the bash command returns 10
+* and if the value is unset, it returns 20.
+*/
+   execle("/bin/bash", "bash", "-c",
+  "[[ -z \"${TMPDIR}\" ]] || exit 10 && exit 20", NULL,
+  bash_envp);
+   exit(errno);
+   } else if (child_pid > 0) {
+   waitpid(child_pid, _status, 0);
+   ret = WEXITSTATUS(child_status);
+
+   /* If a secureexec occured, the exit status should be 20.
+*/
+   if (secureexec && ret == 20)
+   return 0;
+
+   /* If normal execution happened the exit code should be 10.
+*/
+   if (!secureexec && ret == 10)
+   return 0;
+


and here (rest looks good to me)


+   }
+
+   return -EINVAL;
+}
+


Re: [PATCH v11 07/16] PCI/ERR: Simplify by computing pci_pcie_type() once

2020-11-17 Thread Kelley, Sean V
Hi Sathya,

> On Nov 17, 2020, at 1:58 PM, Kuppuswamy, Sathyanarayanan 
>  wrote:
> 
> Hi,
> 
> On 11/17/20 11:19 AM, Sean V Kelley wrote:
>> Instead of calling pci_pcie_type(dev) twice, call it once and save the
>> result.  No functional change intended.
> 
> Same optimization can be applied to drivers/pci/pcie/portdrv_pci.c and
> drivers/pci/pcie/aer.c.
> 
> Can you fix them together ?

Makes sense.  I can combine the changes.

Thanks,

Sean

> 
>> [bhelgaas: split to separate patch]
>> Link: 
>> https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org
>> Signed-off-by: Sean V Kelley 
>> Signed-off-by: Bjorn Helgaas 
>> Acked-by: Jonathan Cameron 
>> ---
>>  drivers/pci/pcie/err.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 05f61da5ed9d..7a5af873d8bc 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -150,6 +150,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  pci_channel_state_t state,
>>  pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>  {
>> +int type = pci_pcie_type(dev);
>>  pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>  struct pci_bus *bus;
>>  @@ -157,8 +158,8 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   * Error recovery runs on all subordinates of the first downstream port.
>>   * If the downstream port detected the error, it is cleared at the end.
>>   */
>> -if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
>> -  pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
>> +if (!(type == PCI_EXP_TYPE_ROOT_PORT ||
>> +  type == PCI_EXP_TYPE_DOWNSTREAM))
>>  dev = pci_upstream_bridge(dev);
>>  bus = dev->subordinate;
>>  
> 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer



Re: [PATCH v3 00/14] iommu/amd: Add Generic IO Page Table Framework Support

2020-11-17 Thread Will Deacon
Hey Suravee (it's been a while!),

On Fri, Nov 13, 2020 at 12:57:18PM +0700, Suravee Suthikulpanit wrote:
> Please ignore to include the V3. I am working on V4 to resubmit.

Please can you put me on CC for that?

Thanks,

Will


Re: [PATCH] iommu: fix return error code in iommu_probe_device()

2020-11-17 Thread Will Deacon
On Tue, Nov 17, 2020 at 07:11:28PM +0800, Yang Yingliang wrote:
> On 2020/11/17 17:40, Lu Baolu wrote:
> > On 2020/11/17 10:52, Yang Yingliang wrote:
> > > If iommu_group_get() failed, it need return error code
> > > in iommu_probe_device().
> > > 
> > > Fixes: cf193888bfbd ("iommu: Move new probe_device path...")
> > > Reported-by: Hulk Robot 
> > > Signed-off-by: Yang Yingliang 
> > > ---
> > >   drivers/iommu/iommu.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index b53446bb8c6b..6f4a32df90f6 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -253,8 +253,10 @@ int iommu_probe_device(struct device *dev)
> > >   goto err_out;
> > >     group = iommu_group_get(dev);
> > > -    if (!group)
> > > +    if (!group) {
> > > +    ret = -ENODEV;
> > 
> > Can you please explain why you use -ENODEV here?
> 
> Before 79659190ee97 ("iommu: Don't take group reference in
> iommu_alloc_default_domain()"), in
> 
> iommu_alloc_default_domain(), if group is NULL, it will return -ENODEV.

Hmm. While I think the patch is ok, I'm not sure it qualifies as a fix.
Has iommu_probe_device() ever propagated this error? The commit you
identify in the 'Fixes:' tag doesn't seem to change this afaict.

Will


Re: [PATCH bpf-next v3 1/2] bpf: Add bpf_lsm_set_bprm_opts helper

2020-11-17 Thread Daniel Borkmann

On 11/17/20 3:13 AM, KP Singh wrote:

From: KP Singh 

The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_LSM_F_BPRM_SECUREEXEC flag.

secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc.  When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).

Signed-off-by: KP Singh 
---
  include/uapi/linux/bpf.h   | 18 ++
  kernel/bpf/bpf_lsm.c   | 27 +++
  scripts/bpf_helpers_doc.py |  2 ++
  tools/include/uapi/linux/bpf.h | 18 ++
  4 files changed, 65 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 162999b12790..bfa79054d106 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3787,6 +3787,18 @@ union bpf_attr {
   **ARG_PTR_TO_BTF_ID* of type *task_struct*.
   *Return
   *Pointer to the current task.
+ *
+ * long bpf_lsm_set_bprm_opts(struct linux_binprm *bprm, u64 flags)
+ *


small nit: should have no extra newline (same for the tools/ copy)


+ * Description
+ * Set or clear certain options on *bprm*:
+ *
+ * **BPF_LSM_F_BPRM_SECUREEXEC** Set the secureexec bit
+ * which sets the **AT_SECURE** auxv for glibc. The bit
+ * is cleared if the flag is not specified.
+ * Return
+ * **-EINVAL** if invalid *flags* are passed.
+ *
   */
  #define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3948,6 +3960,7 @@ union bpf_attr {
FN(task_storage_get),   \
FN(task_storage_delete),\
FN(get_current_task_btf),   \
+   FN(lsm_set_bprm_opts),  \
/* */
  
  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

@@ -4119,6 +4132,11 @@ enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_IP,
  };
  
+/* Flags for LSM helpers */

+enum {
+   BPF_LSM_F_BPRM_SECUREEXEC   = (1ULL << 0),
+};
+
  #define __bpf_md_ptr(type, name)  \
  union {   \
type name;  \
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 553107f4706a..cd85482228a0 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -7,6 +7,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -51,6 +52,30 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
return 0;
  }
  
+/* Mask for all the currently supported BPRM option flags */

+#define BPF_LSM_F_BRPM_OPTS_MASK   BPF_LSM_F_BPRM_SECUREEXEC
+
+BPF_CALL_2(bpf_lsm_set_bprm_opts, struct linux_binprm *, bprm, u64, flags)
+{
+


ditto

Would have fixed up these things on the fly while applying, but one small item
I wanted to bring up here given uapi which will then freeze: it would be cleaner
to call the helper just bpf_bprm_opts_set() or so given it's implied that we
attach to lsm here and we don't use _lsm in the naming for the others either.
Similarly, I'd drop the _LSM from the flag/mask.


+   if (flags & ~BPF_LSM_F_BRPM_OPTS_MASK)
+   return -EINVAL;
+
+   bprm->secureexec = (flags & BPF_LSM_F_BPRM_SECUREEXEC);
+   return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_lsm_set_bprm_opts_btf_ids, struct, linux_binprm)
+
+const static struct bpf_func_proto bpf_lsm_set_bprm_opts_proto = {
+   .func   = bpf_lsm_set_bprm_opts,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_BTF_ID,
+   .arg1_btf_id= _lsm_set_bprm_opts_btf_ids[0],
+   .arg2_type  = ARG_ANYTHING,
+};
+
  static const struct bpf_func_proto *
  bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
@@ -71,6 +96,8 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _task_storage_get_proto;
case BPF_FUNC_task_storage_delete:
return _task_storage_delete_proto;
+   case BPF_FUNC_lsm_set_bprm_opts:
+   return _lsm_set_bprm_opts_proto;
default:
return tracing_prog_func_proto(func_id, prog);
}


Re: [PATCH v11 02/16] PCI/RCEC: Add RCEC class code and extended capability

2020-11-17 Thread Kelley, Sean V
Hi Sathya,

Thanks for reviewing

> On Nov 17, 2020, at 12:07 PM, Kuppuswamy, Sathyanarayanan 
>  wrote:
> 
> Hi,
> 
> On 11/17/20 11:19 AM, Sean V Kelley wrote:
>> From: Qiuxu Zhuo 
>> A PCIe Root Complex Event Collector (RCEC) has base class 0x08, sub-class
>> 0x07, and programming interface 0x00.  Add the class code 0x0807 to
>> identify RCEC devices and add #defines for the RCEC Endpoint Association
>> Extended Capability.
>> See PCIe r5.0, sec 1.3.4 ("Root Complex Event Collector") and sec 7.9.10
>> ("Root Complex Event Collector Endpoint Association Extended Capability").
> Why not merge this change with usage patch ? Keeping changes together will 
> help
> in case of reverting the code.

These are spec derived values that have been absent until now.   They could be 
combined with usage.

Sean



>> Link: 
>> https://lore.kernel.org/r/20201002184735.1229220-2-seanvk@oregontracks.org
>> Signed-off-by: Qiuxu Zhuo 
>> Signed-off-by: Bjorn Helgaas 
>> Reviewed-by: Jonathan Cameron 
>> ---
>>  include/linux/pci_ids.h   | 1 +
>>  include/uapi/linux/pci_regs.h | 7 +++
>>  2 files changed, 8 insertions(+)
>> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
>> index 1ab1e24bcbce..d8156a5dbee8 100644
>> --- a/include/linux/pci_ids.h
>> +++ b/include/linux/pci_ids.h
>> @@ -81,6 +81,7 @@
>>  #define PCI_CLASS_SYSTEM_RTC0x0803
>>  #define PCI_CLASS_SYSTEM_PCI_HOTPLUG0x0804
>>  #define PCI_CLASS_SYSTEM_SDHCI  0x0805
>> +#define PCI_CLASS_SYSTEM_RCEC   0x0807
>>  #define PCI_CLASS_SYSTEM_OTHER  0x0880
>>#define PCI_BASE_CLASS_INPUT  0x09
>> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
>> index a95d55f9f257..bccd3e35cb65 100644
>> --- a/include/uapi/linux/pci_regs.h
>> +++ b/include/uapi/linux/pci_regs.h
>> @@ -831,6 +831,13 @@
>>  #define  PCI_PWR_CAP_BUDGET(x)  ((x) & 1)   /* Included in system 
>> budget */
>>  #define PCI_EXT_CAP_PWR_SIZEOF  16
>>  +/* Root Complex Event Collector Endpoint Association  */
>> +#define PCI_RCEC_RCIEP_BITMAP   4   /* Associated Bitmap for RCiEPs 
>> */
>> +#define PCI_RCEC_BUSN   8   /* RCEC Associated Bus Numbers 
>> */
>> +#define  PCI_RCEC_BUSN_REG_VER  0x02/* Least version with BUSN 
>> present */
>> +#define  PCI_RCEC_BUSN_NEXT(x)  (((x) >> 8) & 0xff)
>> +#define  PCI_RCEC_BUSN_LAST(x)  (((x) >> 16) & 0xff)
>> +
>>  /* Vendor-Specific (VSEC, PCI_EXT_CAP_ID_VNDR) */
>>  #define PCI_VNDR_HEADER 4   /* Vendor-Specific Header */
>>  #define  PCI_VNDR_HEADER_ID(x)  ((x) & 0x)
> 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer



Re: [PATCH] spi: dw: Set transfer handler before unmasking the IRQs

2020-11-17 Thread Mark Brown
On Tue, 17 Nov 2020 12:40:54 +0300, Serge Semin wrote:
> It turns out the IRQs most like can be unmasked before the controller is
> enabled with no problematic consequences. The manual doesn't explicitly
> state that, but the examples perform the controller initialization
> procedure in that order. So the commit da8f58909e7e ("spi: dw: Unmask IRQs
> after enabling the chip") hasn't been that required as I thought. But
> anyway setting the IRQs up after the chip enabling still worth adding
> since it has simplified the code a bit. The problem is that it has
> introduced a potential bug. The transfer handler pointer is now
> initialized after the IRQs are enabled. That may and eventually will cause
> an invalid or uninitialized callback invocation. Fix that just by
> performing the callback initialization before the IRQ unmask procedure.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/1] spi: dw: Set transfer handler before unmasking the IRQs
  commit: a41b0ad07bfa081584218431cb0cd7e7ecc71210

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [PATCH] ASoC: intel: SND_SOC_INTEL_KEEMBAY should depend on ARCH_KEEMBAY

2020-11-17 Thread Mark Brown
On Tue, 10 Nov 2020 15:50:01 +0100, Geert Uytterhoeven wrote:
> The Intel Keem Bay audio module is only present on Intel Keem Bay SoCs.
> Hence add a dependency on ARCH_KEEMBAY, to prevent asking the user about
> this driver when configuring a kernel without Intel Keem Bay platform
> support.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: intel: SND_SOC_INTEL_KEEMBAY should depend on ARCH_KEEMBAY
  commit: 9a207228bdf0a4933b794c944d7111564353ea94

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [PATCH] ASoc: adi: Kconfig: Remove depends on for ADI reference designs

2020-11-17 Thread Mark Brown
On Tue, 10 Nov 2020 17:22:13 +0200, Alexandru Ardelean wrote:
> Audio ADI reference designs are also used on some ZynqMP boards, and can
> also be used on Intel FPGA boards and also on some more complex FPGA
> combinations (FPGA cards connected through PCIe).
> 
> This change removes the dependency on Microblaze and Zynq architectures
> to allow the usage of this driver for the systems described above.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoc: adi: Kconfig: Remove depends on for ADI reference designs
  commit: e1ade4c55ae3559b082faf9f5207cc6caba1c546

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [PATCH] ASoC: Fix 7/8 spaces indentation in Kconfig

2020-11-17 Thread Mark Brown
On Tue, 10 Nov 2020 18:49:04 +0100, Geert Uytterhoeven wrote:
> Some entries used 7 or 8 spaces instead if a single TAB.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: Fix 7/8 spaces indentation in Kconfig
  commit: 5268e0bf7123c422892fec362f5be2bcae9bbb95

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


<    1   2   3   4   5   6   7   8   9   10   >