Re: [PATCH 5/9] sched: prctl() core-scheduling interface

2021-04-17 Thread Joel Fernandes
On Wed, Apr 07, 2021 at 07:00:33PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 01, 2021 at 03:10:17PM +0200, Peter Zijlstra wrote:
> 
> > Current hard-coded policies are:
> > 
> >  - a user can clear the cookie of any process they can set a cookie for.
> >Lack of a cookie *might* be a security issue if cookies are being used
> >for that.
> 
> ChromeOS people, what are you doing about this? syscall/prctl filtering?

Yes, in ChromeOS, we allow the prctl(2) syscall only before entering the
seccomp sandbox. Once we enter the sandbox, we do not allow the prctl(2).

This has the nice design that the security is enforced on entering the
sandbox, and prior to entering the sandbox, no permissions need be given.

Let me know if that makes sense and if you had any other questions. thanks,

-Joel


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-17 Thread Joel Fernandes
On Tue, Apr 06, 2021 at 10:16:12AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> > Yeah, its at http://lore.kernel.org/r/20200822030155.ga414...@google.com
> > as mentioned above, let me know if you need any more details about
> > usecase.
> 
> Except for the unspecified reason in usecase 4, I don't see why cgroup is in
> the picture at all. This doesn't really have much to do with hierarchical
> resource distribution. Besides, yes, you can use cgroup for logical
> structuring and identificaiton purposes but in those cases the interactions
> and interface should be with the original subsystem while using cgroup IDs
> or paths as parameters - see tracing and bpf for examples.

Personally for ChromeOS, we need only the per-task interface. Considering
that the second argument of this prctl is a command, I don't see why we
cannot add a new command PR_SCHED_CORE_CGROUP_SHARE to do what Tejun is
saying (in the future).

In order to not block ChromeOS and other "per-task interface" usecases, I
suggest we keep the CGroup interface for a later time (whether that's
through prctl or the CGroups FS way which Tejun dislikes) and move forward
with per-task interface only initially.

Peter, any thoughts on this?

thanks,

- Joel


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-05 Thread Joel Fernandes
Hi TJ, Peter,

On Sun, Apr 4, 2021 at 7:39 PM Tejun Heo  wrote:
>
> cc'ing Michal and Christian who've been spending some time on cgroup
> interface issues recently and Li Zefan for cpuset.
>
> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. 
> > It is
> > however changed such that you can have nested tags. The for any given task, 
> > the
> > first parent with a cookie is the effective one. The rationale is that this 
> > way
> > you can delegate subtrees and still allow them some control over grouping.
>
> I find it difficult to like the proposed interface from the name (the term
> "core" is really confusing given how the word tends to be used internally)
> to the semantics (it isn't like anything else) and even the functionality
> (we're gonna have fixed processors at some point, right?).
>
> Here are some preliminary thoughts:
>
> * Are both prctl and cgroup based interfaces really necessary? I could be
>   being naive but given that we're (hopefully) working around hardware
>   deficiencies which will go away in time, I think there's a strong case for
>   minimizing at least the interface to the bare minimum.

I don't think these issues are going away as there are constantly new
exploits related to SMT that are coming out. Further, core scheduling
is not only for SMT - there are other usecases as well (such as VM
performance by preventing vCPU threads from core-sharing).

>
>   Given how cgroups are set up (membership operations happening only for
>   seeding, especially with the new clone interface), it isn't too difficult
>   to synchronize process tree and cgroup hierarchy where it matters - ie.
>   given the right per-process level interface, restricting configuration for
>   a cgroup sub-hierarchy may not need any cgroup involvement at all. This
>   also nicely gets rid of the interaction between prctl and cgroup bits.
> * If we *have* to have cgroup interface, I wonder whether this would fit a
>   lot better as a part of cpuset. If you squint just right, this can be
>   viewed as some dynamic form of cpuset. Implementation-wise, it probably
>   won't integrate with the rest but I think the feature will be less jarring
>   as a part of cpuset, which already is a bit of kitchensink anyway.

I think both interfaces are important for different reasons. Could you
take a look at the initial thread I started few months ago? I tried to
elaborate about usecases in detail:
http://lore.kernel.org/r/20200822030155.ga414...@google.com

Also, in ChromeOS we can't use CGroups for this purpose. The CGroup
hierarchy does not fit well with the threads we are tagging. Also, we
use CGroup v1, and since CGroups cannot overlap, this is impossible
let alone cumbersome. And, the CGroup interface having core scheduling
is still useful for people using containers and wanting to
core-schedule each container separately ( +Hao Luo can elaborate more
on that, but I did describe it in the link above).

> > The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
> > needs to ensure continuity between ss->can_attach() and ss->attach() for the
> > memory allocation. If the prctl() were allowed to interleave it might steal 
> > the
> > memory.
> >
> > Using cgroup_mutex feels icky, but is not without precedent,
> > kernel/bpf/cgroup.c does the same thing afaict.
> >
> > TJ, can you please have a look at this?
>
> Yeah, using cgroup_mutex for stabilizing cgroup hierarchy for consecutive
> operations is fine. It might be worthwhile to break that out into a proper
> interface but that's the least of concerns here.
>
> Can someone point me to a realistic and concrete usage scenario for this
> feature?

Yeah, its at http://lore.kernel.org/r/20200822030155.ga414...@google.com
as mentioned above, let me know if you need any more details about
usecase.

About the file name, how about kernel/sched/smt.c ? That definitely
provides more information than 'core_sched.c'.

Thanks,
- Joel


[PATCH resend 7/8] Documentation: Add core scheduling documentation

2021-03-24 Thread Joel Fernandes (Google)
Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser 
Co-developed-by: Vineeth Pillai 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 461 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..0ef00edd50e6
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,460 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl(2) interface
+##
+
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+Permission to change the ``cookie`` and hence the core scheduling group it
+represents is based on ``ptrace access``.
+
+::
+
+#include 
+
+int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4, unsigned long arg5);
+
+int prctl(PR_SCHED_CORE_SHARE, sub_command, pid, pid_type, 0);
+
+option:
+``PR_SCHED_CORE_SHARE``
+
+arg2:
+sub-command:
+
+- ``PR_SCHED_CORE_CLEAR0  -- clear core_sched cookie of pid``
+- ``PR_SCHED_CORE_CREATE   1  -- create a new cookie for pid``
+- ``PR_SCHED_CORE_SHARE_FROM   2  -- copy core_sched cookie from pid``
+- ``PR_SCHED_CORE_SHARE_TO 3  -- copy core_sched cookie to pid``
+
+arg3:
+``pid`` of the task for which the operation applies where ``pid == 0``
+implies current process.
+
+arg4:
+``pid_type`` for PR_SCHED_CORE_CLEAR/CREATE/SHARE_TO is an enum
+{PIDTYPE_PID=0, PIDTYPE_TGID, PIDTYPE_PGID} and determines how the target
+``pid`` should be interpreted. ``PIDTYPE_PID`` indicates that the target
+``pid`` should be treated as an individual task, ``PIDTYPE_TGID`` a process
+or thread group, and ``PIDTYPE_PGID`` or a process group ``PIDTYPE_PGID``.
+
+arg5:
+MUST be equal to 0.
+
+Return Value:
+::
+
+EINVAL - bad parame

[PATCH resend 3/8] sched: prctl() cookie manipulation for core scheduling

2021-03-24 Thread Joel Fernandes (Google)
From: chris hyser 

This patch provides support for setting, clearing and copying core
scheduling 'task cookies' between threads (PID), processes (TGID), and
process groups (PGID).

The value of core scheduling isn't that tasks don't share a core, 'nosmt'
can do that. The value lies in exploiting all the sharing opportunities
that exist to recover possible lost performance and that requires a degree
of flexibility in the API. From a security perspective (and there are
others), the thread, process and process group distinction is an existent
hierarchal categorization of tasks that reflects many of the security
concerns about 'data sharing'. For example, protecting against
cache-snooping by a thread that can just read the memory directly isn't all
that useful. With this in mind, subcommands to CLEAR/CREATE/SHARE (TO/FROM)
provide a mechanism to create, clear and share cookies.
CLEAR/CREATE/SHARE_TO specify a target pid with enum pidtype used to
specify the scope of the targeted tasks. For example, PIDTYPE_TGID will
share the cookie with the process and all of it's threads as typically
desired in a security scenario.

API:

prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, srcpid, 0, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, 0)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
PIDTYPE_SID, sharing a cookie with an entire session, was considered less
useful given the choice to create a new cookie on task exec().

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. EACCES indicates that a task in the target pidtype
group was not updated due to permission.

In terms of interaction with the cgroup interface, task cookies are set
independently of cgroup core scheduling cookies and thus would allow use
for tasks within a container using cgroup cookies.

Current hard-coded policies are:
- a user can clear the cookie of any process they can set a cookie for.
Lack of a cookie *might* be a security issue if cookies are being used
for that.
- on fork of a parent with a cookie, both process and thread child tasks
get a copy.
- on exec a task with a cookie is given a new cookie

Signed-off-by: Chris Hyser 
Signed-off-by: Josh Don 
---
 fs/exec.c|   4 +-
 include/linux/sched.h|  11 ++
 include/linux/sched/task.h   |   4 +-
 include/uapi/linux/prctl.h   |   7 ++
 kernel/sched/core.c  |  11 +-
 kernel/sched/coretag.c   | 196 ++-
 kernel/sched/sched.h |   2 +
 kernel/sys.c |   7 ++
 tools/include/uapi/linux/prctl.h |   7 ++
 9 files changed, 241 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ab0945508b50 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1807,7 +1807,9 @@ static int bprm_execve(struct linux_binprm *bprm,
if (IS_ERR(file))
goto out_unmark;
 
-   sched_exec();
+   retval = sched_exec();
+   if (retval)
+   goto out;
 
bprm->file = file;
/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 833f8d682212..075b15392a4a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2184,8 +2184,19 @@ const struct cpumask *sched_trace_rd_span(struct 
root_domain *rd);
 
 #ifdef CONFIG_SCHED_CORE
 void sched_tsk_free(struct task_struct *tsk);
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
+int sched_core_exec(void);
 #else
 #define sched_tsk_free(tsk) do { } while (0)
+static inline int sched_core_share_pid(unsigned long flags, pid_t pid, enum 
pid_type type)
+{
+   return 0;
+}
+
+static inline int sched_core_exec(void)
+{
+   return 0;
+}
 #endif
 
 #endif
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..d0f5b233f092 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,9 @@ extern void free_task(struct task_struct *tsk);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+int sched_exec(void);
 #else
-#define sched_exec()   {}
+static inline int sched_exec(void) { return 0; }
 #endif
 
 static inline struct task_struct *get_task_struct(struct task_struct *t)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 667f1aed091c..e658dca88f4f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -255,4 +255,11 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW 0
 # define SYSCALL_DISPATCH_FILTER_BLOCK 1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE60
+# 

[PATCH resend 6/8] kselftest: Add tests for core-sched interface

2021-03-24 Thread Joel Fernandes (Google)
Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: chris hyser 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|   4 +-
 .../testing/selftests/sched/test_coresched.c  | 812 ++
 3 files changed, 815 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
index 6996d4654d92..a4b4a1cdcd93 100644
--- a/tools/testing/selftests/sched/.gitignore
+++ b/tools/testing/selftests/sched/.gitignore
@@ -1 +1,2 @@
 cs_prctl_test
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
index 10c72f14fea9..830766e12bed 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  
-Wl,-rpath=./ \
  $(CLANG_FLAGS)
 LDLIBS += -lpthread
 
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := test_coresched cs_prctl_test
+TEST_PROGS := test_coresched cs_prctl_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..9d47845e6f8a
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,812 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR0
+# define PR_SCHED_CORE_CREATE   1
+# define PR_SCHED_CORE_SHARE_FROM   2
+# define PR_SCHED_CORE_SHARE_TO 3
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+   printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+   printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+   if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+   }
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+   char tag_path[50] = {}, rdbuf[8] = {};
+   int tfd;
+
+   sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+   tfd = open(tag_path, O_RDONLY, 0666);
+   if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+   }
+
+   if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match (exp: %s, act: %s)\n", tag,
+  rdbuf);
+   abort();
+   }
+
+   if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+   }
+}
+
+void tag_group(char *cgroup_path)
+{
+   char tag_path[50];
+   int tfd;
+
+   sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+   tfd = open(tag_path, O_WRONLY, 0666);
+   if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   if (write(tfd, "1", 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+   }
+
+   if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+   }
+
+   assert_group_tag(cgroup_path, "1");
+}
+
+void untag_group(char *cgroup_path)
+{
+   char tag_path[50];
+   int tfd;
+
+   sprintf(tag_path, 

[PATCH resend 8/8] sched: Debug bits...

2021-03-24 Thread Joel Fernandes (Google)
Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 40 +++-
 kernel/sched/fair.c | 12 
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a733891dfe7d..2649efeac19f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%llu,%llu) ?< (%s/%d;%d,%llu,%llu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -292,12 +296,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
__sched_core_flip(true);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
__sched_core_flip(false);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5361,6 +5369,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %llu\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie.userspace_id);
+
rq->core_pick = NULL;
return next;
}
@@ -5455,6 +5470,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %llu\n",
+i, p->comm, p->pid,
+p->core_cookie.userspace_id);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5471,6 +5490,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %llu\n",
+max->comm, max->pid,
+max->core_cookie.userspace_id);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5492,6 +5515,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %llu\n", next->comm, next->pid,
+next->core_cookie.userspace_id);
 
/*
 * Reschedule siblings
@@ -5533,13 +5558,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%llu/0x%llu\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie.userspace_id,
+rq_i->core->core_cookie.userspace_id);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5579,6 +5612,11 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %llu\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation,
+cookie->userspace_id);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c 

[PATCH resend 5/8] sched: cgroup cookie API for core scheduling

2021-03-24 Thread Joel Fernandes (Google)
From: Josh Don 

This adds the API to set/get the cookie for a given cgroup. This
interface lives at cgroup/cpu.core_tag.

The cgroup interface can be used to toggle a unique cookie value for all
descendent tasks, preventing these tasks from sharing with any others.
See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
rundown of both this and the per-task API.

Signed-off-by: Josh Don 
---
 kernel/sched/core.c|  61 ++--
 kernel/sched/coretag.c | 156 -
 kernel/sched/sched.h   |  25 +++
 3 files changed, 235 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3093cb3414c3..a733891dfe7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9328,6 +9328,8 @@ struct task_group *sched_create_group(struct task_group 
*parent)
 
alloc_uclamp_sched_group(tg, parent);
 
+   alloc_sched_core_sched_group(tg);
+
return tg;
 
 err:
@@ -9391,6 +9393,11 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+   sched_core_change_group(tsk, tg);
+#endif
+
tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9443,11 +9450,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-   return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9483,6 +9485,18 @@ static int cpu_cgroup_css_online(struct 
cgroup_subsys_state *css)
return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+   struct task_group *tg = css_tg(css);
+
+   if (tg->core_tagged) {
+   sched_core_put();
+   tg->core_tagged = 0;
+   }
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
struct task_group *tg = css_tg(css);
@@ -9517,6 +9531,25 @@ static void cpu_cgroup_fork(struct task_struct *task)
task_rq_unlock(rq, task, );
 }
 
+static void cpu_cgroup_exit(struct task_struct *task)
+{
+#ifdef CONFIG_SCHED_CORE
+   /*
+* This is possible if task exit races with core sched being
+* disabled due to the task's cgroup no longer being tagged, since
+* cpu_core_tag_write_u64() will miss dying tasks.
+*/
+   if (unlikely(sched_core_enqueued(task))) {
+   struct rq *rq;
+   struct rq_flags rf;
+
+   rq = task_rq_lock(task, );
+   sched_core_dequeue(rq, task);
+   task_rq_unlock(rq, task, );
+   }
+#endif
+}
+
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
struct task_struct *task;
@@ -10084,6 +10117,14 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
 #endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "core_tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
@@ -10257,6 +10298,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_weight_nice_write_s64,
},
 #endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "core_tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
@@ -10285,10 +10334,12 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc  = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+   .css_offline= cpu_cgroup_css_offline,
.css_released   = cpu_cgroup_css_released,
.css_free   = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
.fork   = cpu_cgroup_fork,
+   .exit   = cpu_cgroup_exit,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
.legacy_cftypes = cpu_legacy_files,
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 550f4975eea2..1498790bc76c 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -96,9 +96,19 @@ static void __sched_core_set_task_cookie(struct 
sched_core_cookie *cookie,
 static void __sched_core_set_group_cookie(struct sched_core_cookie *cookie,
  unsigned long val)
 {
+   struct 

[PATCH resend 1/8] sched: migration changes for core scheduling

2021-03-24 Thread Joel Fernandes (Google)
From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task may be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 29 ++
 kernel/sched/sched.h | 73 
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a03564398605..12030b73a032 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5877,11 +5877,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -5967,9 +5971,10 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
return cpu;
 
return -1;
@@ -6041,7 +6046,7 @@ static int select_idle_core(struct task_struct *p, int 
core, struct cpumask *cpu
int cpu;
 
if (!static_branch_likely(_smt_present))
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
@@ -6079,7 +6084,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
 {
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6132,7 +6137,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
} else {
if (!--nr)
return -1;
-   idle_cpu = __select_idle_cpu(cpu);
+   idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
}
@@ -7473,6 +7478,14 @@ static int task_hot(struct task_struct *p, struct lb_env 
*env)
 
if (sysctl_sched_migration_cost == -1)
return 1;
+
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 1;
+
if (sysctl_sched_migration_cost == 0)
return 0;
 
@@ -8834,6 +8847,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 80abbc0af680..12edfb8f6994 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,8 +1128,10 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 #endif
 }
 
+struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1163,6 +1165,61 @@ static inline raw_spinlock_t *__rq_lockp(struct rq

[PATCH resend 2/8] sched: core scheduling tagging infrastructure

2021-03-24 Thread Joel Fernandes (Google)
From: Josh Don 

A single unsigned long is insufficient as a cookie value for core
scheduling. We will minimally have cookie values for a per-task and a
per-group interface, which must be combined into an overall cookie.

This patch adds the infrastructure necessary for setting task and group
cookie. Namely, it reworks the core_cookie into a struct, and provides
interfaces for setting task and group cookie, as well as other
operations (i.e. compare()). Subsequent patches will use these hooks to
provide an API for setting these cookies.

One important property of this interface is that neither the per-task
nor the per-cgroup setting overrides the other. For example, if two
tasks are in different cgroups, and one or both of the cgroups is tagged
using the per-cgroup interface, then these tasks cannot share, even if
they use the per-task interface to attempt to share with one another.

Core scheduler has extra overhead.  Enable it only for machines with
more than one SMT hardware thread.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Signed-off-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
---
 include/linux/sched.h  |  24 +++-
 kernel/fork.c  |   1 +
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 100 ++---
 kernel/sched/coretag.c | 245 +
 kernel/sched/debug.c   |   4 +
 kernel/sched/sched.h   |  57 --
 7 files changed, 384 insertions(+), 48 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d91ff1d3a30..833f8d682212 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -645,6 +645,22 @@ struct kmap_ctrl {
 #endif
 };
 
+#ifdef CONFIG_SCHED_CORE
+struct sched_core_cookie {
+   unsigned long task_cookie;
+#ifdef CONFIG_CGROUP_SCHED
+   unsigned long group_cookie;
+#endif
+
+   /*
+* A u64 representation of the cookie used only for display to
+* userspace. We avoid exposing the actual cookie contents, which
+* are kernel pointers.
+*/
+   u64 userspace_id;
+};
+#endif
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -703,7 +719,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
-   unsigned long   core_cookie;
+   struct sched_core_cookiecore_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2166,4 +2182,10 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 54cc905e5fe0..cbe461105b10 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -737,6 +737,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53d742ed6432..1b07687c53d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -123,11 +123,13 @@ static inline bool prio_less(struct task_struct *a, 
struct task_struct *b, bool
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
 {
-   if (a->core_cookie < b->core_cookie)
-   return true;
+   int cmp = sched_core_cookie_cmp(>core_cookie, >core_cookie);
 
-   if (a->core_cookie > b->core_cookie)
-   return false;
+   if (cmp < 0)
+   return true; /* a < b */
+
+   if (cmp > 0)
+   return false; /* a > b */
 
/* flip prio, so high prio is leftmost */
if (prio_less(b, a, task_rq(a)->core->core_forceidle))
@@ -146,41 +148,49 @@ static inline bool rb_sched_core_less(struct rb_node *a, 
const struct rb_node *b
 static inline int rb_sched_core_cmp(const void *key, const struct rb_node 
*node)
 {
const struct task_struct *p = __node_2_sc(node);
-   unsigned long cookie = (unsigned long)key;
+   const struct sched_core_cookie *cookie = key;
+   int cmp = sched_core_cookie_cm

[PATCH resend 4/8] kselftest: Add test for core sched prctl interface

2021-03-24 Thread Joel Fernandes (Google)
From: chris hyser 

Provides a selftest and examples of using the interface.

Signed-off-by: Chris Hyser 
Signed-off-by: Josh Don 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 tools/testing/selftests/sched/cs_prctl_test.c | 370 ++
 4 files changed, 386 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..6996d4654d92
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+cs_prctl_test
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..10c72f14fea9
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := cs_prctl_test
+TEST_PROGS := cs_prctl_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c 
b/tools/testing/selftests/sched/cs_prctl_test.c
new file mode 100644
index ..03581e180e31
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser 
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see .
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include 
+static pid_t gettid(void)
+{
+   return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE60
+# define PR_SCHED_CORE_CLEAR   0
+# define PR_SCHED_CORE_CREATE  1
+# define PR_SCHED_CORE_SHARE_FROM  2
+# define PR_SCHED_CORE_SHARE_TO3
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"options:\n"
+"  -P  : number of processes to create.\n"
+"  -T  : number of threads per process to create.\n"
+"  -d  : delay time to keep tasks alive.\n"
+"  -k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | 
CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4,
+ unsigned long arg5)
+{
+   int res;
+
+   res = prctl(option, arg2, arg3, arg4, arg5);
+   printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, 
(long)arg3,
+  (long)arg4, arg5);
+   return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+   printf("(%s:%d) - ", fn, ln);
+   perror(msg);
+   exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+   puts(USAGE);
+   puts(msg);
+   putchar('\n');
+   exit(rc);
+}
+
+static unsigned long get_cs_cookie(int pid)
+{
+   char buf[4096];
+   char fn[512];
+   FILE *inf;
+   char *c;
+   int i;
+
+   if (pid == 0)
+   pid = getpid();
+   snprintf(fn, 512, "/proc/%d/sched", pid);
+
+   inf = fopen(fn, "r");
+   if (!inf)
+   return -2UL;
+
+   while (fgets(buf, 4096, inf)) {
+   if (!strncmp(buf, "core_cookie", 11)) 

[PATCH resend 0/8] Core sched remaining patches rebased

2021-03-24 Thread Joel Fernandes (Google)
 Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Josh Don (2):
sched: core scheduling tagging infrastructure
sched: cgroup cookie API for core scheduling

chris hyser (2):
sched: prctl() cookie manipulation for core scheduling
kselftest: Add test for core sched prctl interface

.../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
fs/exec.c |   4 +-
include/linux/sched.h |  35 +-
include/linux/sched/task.h|   4 +-
include/uapi/linux/prctl.h|   7 +
kernel/fork.c |   1 +
kernel/sched/Makefile |   1 +
kernel/sched/core.c   | 212 -
kernel/sched/coretag.c| 587 +
kernel/sched/debug.c  |   4 +
kernel/sched/fair.c   |  41 +-
kernel/sched/sched.h  | 151 +++-
kernel/sys.c  |   7 +
tools/include/uapi/linux/prctl.h  |   7 +
tools/testing/selftests/sched/.gitignore  |   2 +
tools/testing/selftests/sched/Makefile|  14 +
tools/testing/selftests/sched/config  |   1 +
tools/testing/selftests/sched/cs_prctl_test.c | 370 
.../testing/selftests/sched/test_coresched.c  | 812 ++
20 files changed, 2659 insertions(+), 62 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.31.0.291.g576ba9dcdaf-goog



Re: [PATCH 0/6] Core scheduling remaining patches

2021-03-22 Thread Joel Fernandes
On Sat, Mar 20, 2021 at 04:40:20PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 04:32:47PM -0400, Joel Fernandes (Google) wrote:
> > Enclosed is interface related core scheduling patches and one for migration.
> > The main core scheduling patches were already pulled in by Peter with these
> > bits left.
> 
> Funny thing, they don't appear to apply to my sched/core-sched branch...
> :/ Either I really shouldn't be working weekends or something went
> wobbly on your end...

Yeah sorry. It is based on a slightly older snapshot of your queue.git
sched/core-sched branch. We will rebase again and resend ASAP.

Thanks!

 - Joel


[PATCH 1/6] sched: migration changes for core scheduling

2021-03-19 Thread Joel Fernandes (Google)
From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 33 +---
 kernel/sched/sched.h | 72 
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f90765f7fd..fddd7c44bbf3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d563b3f97789..877f77044b39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1125,6 +1125,7 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled)

[PATCH 2/6] sched: tagging interface for core scheduling

2021-03-19 Thread Joel Fernandes (Google)
From: Josh Don 

Adds per-task and per-cgroup interfaces for specifying which tasks can
co-execute on adjacent SMT hyperthreads via core scheduling.

The per-task interface hooks are implemented here, but are not currently
used. The following patch adds a prctl interface which then takes
advantage of these.

The cgroup interface can be used to toggle a unique cookie value for all
descendent tasks, preventing these tasks from sharing with any others.
See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
rundown.

One important property of this interface is that neither the per-task
nor the per-cgroup setting overrides the other. For example, if two
tasks are in different cgroups, and one or both of the cgroups is tagged
using the per-cgroup interface, then these tasks cannot share, even if
they use the per-task interface to attempt to share with one another.

The above is implemented by making the overall core scheduling cookie a
compound structure, containing both a task-level cookie and a
group-level cookie. Two tasks will only be allowed to share if all
fields of their respective cookies match.

Core scheduler has extra overhead.  Enable it only for machines with
more than one SMT hardware thread.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Signed-off-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
---
 include/linux/sched.h  |  20 ++-
 kernel/fork.c  |   1 +
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 172 +-
 kernel/sched/coretag.c | 397 +
 kernel/sched/debug.c   |   4 +
 kernel/sched/sched.h   |  85 +++--
 7 files changed, 619 insertions(+), 61 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 344432130b8f..9031aa8fee5b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,20 @@ struct wake_q_node {
struct wake_q_node *next;
 };
 
+#ifdef CONFIG_SCHED_CORE
+struct sched_core_cookie {
+   unsigned long task_cookie;
+   unsigned long group_cookie;
+
+   /* A u64 representation of the cookie used only for display to
+* userspace. We avoid exposing the actual cookie contents, which
+* are kernel pointers.
+*/
+   u64 userspace_id;
+};
+#endif
+
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -687,7 +701,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
-   unsigned long   core_cookie;
+   struct sched_core_cookiecore_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2076,7 +2090,6 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
-#ifdef CONFIG_SCHED_CORE
 enum ht_protect_ctx {
HT_PROTECT_SYSCALL,
HT_PROTECT_IRQ,
@@ -2084,15 +2097,18 @@ enum ht_protect_ctx {
HT_PROTECT_FROM_IDLE
 };
 
+#ifdef CONFIG_SCHED_CORE
 void sched_core_unsafe_enter(enum ht_protect_ctx ctx);
 void sched_core_unsafe_exit(enum ht_protect_ctx ctx);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(enum ht_protect_ctx ctx);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 073047b13126..2e3024a6f6e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -164,11 +164,13 @@ static inline bool prio_less(struct task_struct *a, 
struct task_struct *b, bool
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
 {
-   if (a->core_cook

[PATCH 3/6] sched: prctl() cookie manipulation for core scheduling.

2021-03-19 Thread Joel Fernandes (Google)
From: chris hyser 

This patch provides support for setting, clearing and copying core
scheduling 'task cookies' between threads (PID), processes (TGID), and
process groups (PGID).

The value of core scheduling isn't that tasks don't share a core, 'nosmt'
can do that. The value lies in exploiting all the sharing opportunities
that exist to recover possible lost performance and that requires a degree
of flexibility in the API. From a security perspective (and there are
others), the thread, process and process group distinction is an existent
hierarchal categorization of tasks that reflects many of the security
concerns about 'data sharing'. For example, protecting against
cache-snooping by a thread that can just read the memory directly isn't all
that useful. With this in mind, subcommands to CLEAR/CREATE/SHARE (TO/FROM)
provide a mechanism to create, clear and share cookies.
CLEAR/CREATE/SHARE_TO specify a target pid with enum pidtype used to
specify the scope of the targeted tasks. For example, PIDTYPE_TGID will
share the cookie with the process and all of it's threads as typically
desired in a security scenario.

API:

prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, srcpid, 0, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, 0)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
PIDTYPE_SID, sharing a cookie with an entire session, was considered less
useful given the choice to create a new cookie on task exec().

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access
to tgtpid/srcpid. EACCES indicates that a task in the target pidtype group was
not updated due to permission.

In terms of interaction with the cgroup interface, task cookies are set
independently of cgroup core scheduling cookies and thus would allow use
for tasks within a container using cgroup cookies.

Current hard-coded policies are:
- a user can clear the cookie of any process they can set a cookie for.
Lack of a cookie *might* be a security issue if cookies are being used
for that.
- on fork of a parent with a cookie, both process and thread child tasks
get a copy.
- on exec a task with a cookie is given a new cookie

Signed-off-by: Chris Hyser 
Signed-off-by: Josh Don 
---
 include/linux/sched.h|   7 ++
 include/linux/sched/task.h   |   4 +-
 include/uapi/linux/prctl.h   |   7 ++
 kernel/sched/core.c  |  11 +-
 kernel/sched/coretag.c   | 197 +--
 kernel/sched/sched.h |   2 +
 kernel/sys.c |   7 ++
 tools/include/uapi/linux/prctl.h |   7 ++
 8 files changed, 230 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9031aa8fee5b..6ccbdbf7048b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,13 +2102,20 @@ void sched_core_unsafe_enter(enum ht_protect_ctx ctx);
 void sched_core_unsafe_exit(enum ht_protect_ctx ctx);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(enum ht_protect_ctx ctx);
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
 void sched_tsk_free(struct task_struct *tsk);
+int sched_core_exec(void);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+static inline int sched_core_share_pid(unsigned long flags, pid_t pid, enum 
pid_type type)
+{
+   return 0;
+}
 #define sched_tsk_free(tsk) do { } while (0)
+static inline int sched_core_exec(void) { return 0; }
 #endif
 
 #endif
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 85fb2f34c59b..033033ed641e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,9 @@ extern void free_task(struct task_struct *tsk);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+int sched_exec(void);
 #else
-#define sched_exec()   {}
+static inline int sched_exec(void) { return 0; }
 #endif
 
 static inline struct task_struct *get_task_struct(struct task_struct *t)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..40c7241f5fcb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,11 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0 /* clear core_sched 

[PATCH 4/6] kselftest: Add tests for core-sched interface

2021-03-19 Thread Joel Fernandes (Google)
Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 tools/testing/selftests/sched/cs_prctl_test.c | 372 
 .../testing/selftests/sched/test_coresched.c  | 812 ++
 5 files changed, 1200 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..830766e12bed
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched cs_prctl_test
+TEST_PROGS := test_coresched cs_prctl_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c 
b/tools/testing/selftests/sched/cs_prctl_test.c
new file mode 100644
index ..9e51874533c8
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,372 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser 
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include 
+static pid_t gettid(void)
+{
+   return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0
+# define PR_SCHED_CORE_CREATE  1
+# define PR_SCHED_CORE_SHARE_FROM  2
+# define PR_SCHED_CORE_SHARE_TO3
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"options:\n"
+"  -P  : number of processes to create.\n"
+"  -T  : number of threads per process to create.\n"
+"  -d  : delay time to keep tasks alive.\n"
+"  -k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | 
CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4,
+ unsigned long arg5)
+{
+   int res;
+
+   res = prctl(option, arg2, arg3, arg4, arg5);
+   printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, 
(long)arg3,
+  (long)arg4, arg5);
+   return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+   printf("(%s:%d) - ", fn, ln);
+   perror(msg);
+   exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+   puts(USAGE);
+   puts(msg);
+   putchar('\n');
+   exit(rc);
+}
+
+static unsigned long get_cs_cookie

[PATCH 5/6] Documentation: Add core scheduling documentation

2021-03-19 Thread Joel Fernandes (Google)
Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser 
Co-developed-by: Vineeth Pillai 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 461 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..50042e79709d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,460 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl(2) interface
+##
+
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+Permission to change the ``cookie`` and hence the core scheduling group it
+represents is based on ``ptrace access``.
+
+::
+
+#include 
+
+int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4, unsigned long arg5);
+
+int prctl(PR_SCHED_CORE_SHARE, sub_command, pid, pid_type, 0);
+
+option:
+``PR_SCHED_CORE_SHARE``
+
+arg2:
+sub-command:
+
+- ``PR_SCHED_CORE_CLEAR0  -- clear core_sched cookie of pid``
+- ``PR_SCHED_CORE_CREATE   1  -- create a new cookie for pid``
+- ``PR_SCHED_CORE_SHARE_FROM   2  -- copy core_sched cookie from pid``
+- ``PR_SCHED_CORE_SHARE_TO 3  -- copy core_sched cookie to pid``
+
+arg3:
+``pid`` of the task for which the operation applies where ``pid == 0``
+implies current process.
+
+arg4:
+``pid_type`` for PR_SCHED_CORE_CLEAR/CREATE/SHARE_TO is an enum
+{PIDTYPE_PID=0, PIDTYPE_TGID, PIDTYPE_PGID} and determines how the target
+``pid`` should be interpreted. ``PIDTYPE_PID`` indicates that the target
+``pid`` should be treated as an individual task, ``PIDTYPE_TGID`` a process
+or thread group, and ``PIDTYPE_PGID`` or a process group ``PIDTYPE_PGID``.
+
+arg5:
+MUST be equal to 0.
+
+Return Value:
+::
+
+EINVAL - bad parame

[PATCH 6/6] sched: Debug bits...

2021-03-19 Thread Joel Fernandes (Google)
Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 40 +++-
 kernel/sched/fair.c |  9 +
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62e8ad5ce58..58cca96ba93d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -147,6 +147,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -312,12 +316,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5503,6 +5511,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %llu\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie.userspace_id);
+
rq->core_pick = NULL;
return next;
}
@@ -5597,6 +5612,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %llu\n",
+i, p->comm, p->pid,
+p->core_cookie.userspace_id);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5613,6 +5632,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %llu\n",
+max->comm, max->pid,
+max->core_cookie.userspace_id);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5634,6 +5657,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %llu\n", next->comm, next->pid,
+next->core_cookie.userspace_id);
 
/*
 * Reschedule siblings
@@ -5675,13 +5700,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%llu/0x%llu\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie.userspace_id,
+rq_i->core->core_cookie.userspace_id);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5721,6 +5754,11 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %llu\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation,
+cookie->userspace_id);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);

[PATCH 0/6] Core scheduling remaining patches

2021-03-19 Thread Joel Fernandes (Google)
From: Joel Fernandes 

Core-Scheduling
===
Enclosed is interface related core scheduling patches and one for migration.
The main core scheduling patches were already pulled in by Peter with these
bits left.

Main changes are the simplification of the core cookie scheme,
new prctl code, and other misc fixes based on Peter's feedback.

These remaining patches was worked on mostly by Josh Don and Chris Hyser.

Introduction of feature
===
Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

Both a CGroup and Per-task interface via prctl(2) are provided for configuring
core sharing. More details are provided in documentation patch.  Kselftests are
provided to verify the correctness/rules of the interface.

Testing
===
ChromeOS testing shows 300% improvement in keypress latency on a Google
docs key press with Google hangout test (the maximum latency drops from 150ms
to 50ms for keypresses).

Julien: TPCC tests showed improvements with core-scheduling as below. With 
kernel
protection enabled, it does not show any regression. Possibly ASI will improve
the performance for those who choose kernel protection (can be controlled 
through
ht_protect kernel command line option).
average stdev   diff
baseline (SMT on)   1197.27244.78312824 
core sched (   kernel protect)  412.989545.42734343 -65.51%
core sched (no kernel protect)  686.651571.77756931 -42.65%
nosmt   408.667 39.39042872 -65.87%
(Note these results are from v8).

Vineeth tested sysbench and does not see any regressions.
Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
with uperf that does regress. This appears to be because of ksoftirq heavily
contending with other tasks on the core. The consensus is this can be improved
in the future.

Other changes:
- Fixed breaking of coresched= option patch on !SCHED_CORE builds.
- Trivial commit message changes.

Changes in v10
==
- migration code changes from Aubrey.
- dropped patches merged.
- interface changes from Josh and Chris.

Changes in v9
=
- Note that the vruntime snapshot change is written in 2 patches to show the
  progression of the idea and prevent merge conflicts:
sched/fair: Snapshot the min_vruntime of CPUs on force idle
sched: Improve snapshotting of min_vruntime for CGroups
  Same with the RT priority inversion change:
sched: Fix priority inversion of cookied task with sibling
sched: Improve snapshotting of min_vruntime for CGroups
- Disable coresched on certain AMD HW.

Changes in v8
=
- New interface/API implementation
  - Joel
- Revised kernel protection patch
  - Joel
- Revised Hotplug fixes
  - Joel
- Minor bug fixes and address review comments
  - Vineeth

Changes in v7
=
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
=
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
=
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
=
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
=
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving

[tip: core/rcu] rcu/tree: Make rcu_do_batch count how many callbacks were executed

2021-02-15 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 6bc335828056f3b301a3deadda782de4e8f0db08
Gitweb:
https://git.kernel.org/tip/6bc335828056f3b301a3deadda782de4e8f0db08
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 03 Nov 2020 09:25:57 -05:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 13:22:12 -08:00

rcu/tree: Make rcu_do_batch count how many callbacks were executed

The rcu_do_batch() function extracts the ready-to-invoke callbacks
from the rcu_segcblist located in the ->cblist field of the current
CPU's rcu_data structure.  These callbacks are first moved to a local
(unsegmented) rcu_cblist.  The rcu_do_batch() function then uses this
rcu_cblist's ->len field to count how many CBs it has invoked, but it
does so by counting that field down from zero.  Finally, this function
negates the value in this ->len field (resulting in a positive number)
and subtracts the result from the ->len field of the current CPU's
->cblist field.

Except that it is sometimes necessary for rcu_do_batch() to stop invoking
callbacks mid-stream, despite there being more ready to invoke, for
example, if a high-priority task wakes up.  In this case the remaining
not-yet-invoked callbacks are requeued back onto the CPU's ->cblist,
but remain in the ready-to-invoke segment of that list.  As above, the
negative of the local rcu_cblist's ->len field is still subtracted from
the ->len field of the current CPU's ->cblist field.

The design of counting down from 0 is confusing and error-prone, plus
use of a positive count will make it easier to provide a uniform and
consistent API to deal with the per-segment counts that are added
later in this series.  For example, rcu_segcblist_extract_done_cbs()
can unconditionally populate the resulting unsegmented list's ->len
field during extraction.

This commit therefore explicitly counts how many callbacks were executed
in rcu_do_batch() itself, counting up from zero, and then uses that
to update the per-CPU segcb list's ->len field, without relying on the
downcounting of rcl->len from zero.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c |  2 +-
 kernel/rcu/rcu_segcblist.h |  1 +
 kernel/rcu/tree.c  | 11 +--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 2d2a6b6..bb246d8 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -95,7 +95,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
  */
-static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
smp_mb__before_atomic(); /* Up to the caller! */
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 492262b..1d2d614 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -76,6 +76,7 @@ static inline bool rcu_segcblist_restempty(struct 
rcu_segcblist *rsclp, int seg)
 }
 
 void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v);
 void rcu_segcblist_init(struct rcu_segcblist *rsclp);
 void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40e5e3d..cc6f379 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2434,7 +2434,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
-   long bl, count;
+   long bl, count = 0;
long pending, tlimit = 0;
 
/* If no callbacks are ready, just return. */
@@ -2479,6 +2479,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
+   count++;
debug_rcu_head_unqueue(rhp);
 
rcu_lock_acquire(_callback_map);
@@ -2492,15 +2493,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/*
 * Stop only if limit reached and CPU has something to do.
-* Note: The rcl structure counts down from zero.
 */
-   if (-rcl.len >= bl && !offloaded &&
+   if (count >= bl && !offloaded &&
(need_resched() ||
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
if (unlikely(tlimit)) {
  

[tip: core/rcu] rcu/segcblist: Add additional comments to explain smp_mb()

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: c2e13112e830c06825339cbadf0b3bc2bdb9a716
Gitweb:
https://git.kernel.org/tip/c2e13112e830c06825339cbadf0b3bc2bdb9a716
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 03 Nov 2020 09:26:03 -05:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:23:23 -08:00

rcu/segcblist: Add additional comments to explain smp_mb()

One counter-intuitive property of RCU is the fact that full memory
barriers are needed both before and after updates to the full
(non-segmented) length.  This patch therefore helps to assist the
reader's intuition by adding appropriate comments.

[ paulmck:  Wordsmithing. ]
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 68 ++---
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index bb246d8..3cff800 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,17 +94,77 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * field to disagree with the actual number of callbacks on the structure.
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
+ *
+ * So why on earth is a memory barrier required both before and after
+ * the update to the ->len field???
+ *
+ * The reason is that rcu_barrier() locklessly samples each CPU's ->len
+ * field, and if a given CPU's field is zero, avoids IPIing that CPU.
+ * This can of course race with both queuing and invoking of callbacks.
+ * Failing to correctly handle either of these races could result in
+ * rcu_barrier() failing to IPI a CPU that actually had callbacks queued
+ * which rcu_barrier() was obligated to wait on.  And if rcu_barrier()
+ * failed to wait on such a callback, unloading certain kernel modules
+ * would result in calls to functions whose code was no longer present in
+ * the kernel, for but one example.
+ *
+ * Therefore, ->len transitions from 1->0 and 0->1 have to be carefully
+ * ordered with respect with both list modifications and the rcu_barrier().
+ *
+ * The queuing case is CASE 1 and the invoking case is CASE 2.
+ *
+ * CASE 1: Suppose that CPU 0 has no callbacks queued, but invokes
+ * call_rcu() just as CPU 1 invokes rcu_barrier().  CPU 0's ->len field
+ * will transition from 0->1, which is one of the transitions that must
+ * be handled carefully.  Without the full memory barriers after the ->len
+ * update and at the beginning of rcu_barrier(), the following could happen:
+ *
+ * CPU 0   CPU 1
+ *
+ * call_rcu().
+ * rcu_barrier() sees ->len as 0.
+ * set ->len = 1.
+ * rcu_barrier() does nothing.
+ * module is unloaded.
+ * callback invokes unloaded function!
+ *
+ * With the full barriers, any case where rcu_barrier() sees ->len as 0 will
+ * have unambiguously preceded the return from the racing call_rcu(), which
+ * means that this call_rcu() invocation is OK to not wait on.  After all,
+ * you are supposed to make sure that any problematic call_rcu() invocations
+ * happen before the rcu_barrier().
+ *
+ *
+ * CASE 2: Suppose that CPU 0 is invoking its last callback just as
+ * CPU 1 invokes rcu_barrier().  CPU 0's ->len field will transition from
+ * 1->0, which is one of the transitions that must be handled carefully.
+ * Without the full memory barriers before the ->len update and at the
+ * end of rcu_barrier(), the following could happen:
+ *
+ * CPU 0   CPU 1
+ *
+ * start invoking last callback
+ * set ->len = 0 (reordered)
+ * rcu_barrier() sees ->len as 0
+ * rcu_barrier() does nothing.
+ * module is unloaded
+ * callback executing after unloaded!
+ *
+ * With the full barriers, any case where rcu_barrier() sees ->len as 0
+ * will be fully ordered after the completion of the callback function,
+ * so that the module unloading operation is completely safe.
+ *
  */
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
-   smp_mb__before_atomic(); /* Up to the caller! */
+   smp_mb__before_atomic(); // Read header comment above.
atomic_long_add(v, >len);
-   smp_mb__after_atomic(); /* Up to the caller! */
+   smp_mb__after_atomic();  // Read header comment above.
 #else
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); // Read header comment above.
WRITE_ONCE(rsclp->len, rsclp->len + v);
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); // Read header comment above.
 #endif
 }
 


[tip: core/rcu] rcu/segcblist: Add counters to segcblist datastructure

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: ae5c2341ed3987bd434ed495bd4f3d8b2bc3e623
Gitweb:
https://git.kernel.org/tip/ae5c2341ed3987bd434ed495bd4f3d8b2bc3e623
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 23 Sep 2020 11:22:09 -04:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/segcblist: Add counters to segcblist datastructure

Add counting of segment lengths of segmented callback list.

This will be useful for a number of things such as knowing how big the
ready-to-execute segment have gotten. The immediate benefit is ability
to trace how the callbacks in the segmented callback list change.

Also this patch remove hacks related to using donecbs's ->len field as a
temporary variable to save the segmented callback list's length. This cannot be
done anymore and is not needed.

Also fix SRCU:
The negative counting of the unsegmented list cannot be used to adjust
the segmented one. To fix this, sample the unsegmented length in
advance, and use it after CB execution to adjust the segmented list's
length.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h |   1 +-
 kernel/rcu/rcu_segcblist.c| 120 +
 kernel/rcu/rcu_segcblist.h|   2 +-
 kernel/rcu/srcutree.c |   5 +-
 4 files changed, 82 insertions(+), 46 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index b36afe7..6c01f09 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -72,6 +72,7 @@ struct rcu_segcblist {
 #else
long len;
 #endif
+   long seglen[RCU_CBLIST_NSEGS];
u8 enabled;
u8 offloaded;
 };
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 3cff800..804 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -7,10 +7,10 @@
  * Authors: Paul E. McKenney 
  */
 
-#include 
-#include 
+#include 
 #include 
-#include 
+#include 
+#include 
 
 #include "rcu_segcblist.h"
 
@@ -88,6 +88,46 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
 #endif
 }
 
+/* Get the length of a segment of the rcu_segcblist structure. */
+static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   return READ_ONCE(rsclp->seglen[seg]);
+}
+
+/* Set the length of a segment of the rcu_segcblist structure. */
+static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], v);
+}
+
+/* Increase the numeric length of a segment by a specified amount. */
+static void rcu_segcblist_add_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], rsclp->seglen[seg] + v);
+}
+
+/* Move from's segment length to to's segment. */
+static void rcu_segcblist_move_seglen(struct rcu_segcblist *rsclp, int from, 
int to)
+{
+   long len;
+
+   if (from == to)
+   return;
+
+   len = rcu_segcblist_get_seglen(rsclp, from);
+   if (!len)
+   return;
+
+   rcu_segcblist_add_seglen(rsclp, to, len);
+   rcu_segcblist_set_seglen(rsclp, from, 0);
+}
+
+/* Increment segment's length. */
+static void rcu_segcblist_inc_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   rcu_segcblist_add_seglen(rsclp, seg, 1);
+}
+
 /*
  * Increase the numeric length of an rcu_segcblist structure by the
  * specified amount, which can be negative.  This can cause the ->len
@@ -180,26 +220,6 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
 }
 
 /*
- * Exchange the numeric length of the specified rcu_segcblist structure
- * with the specified value.  This can cause the ->len field to disagree
- * with the actual number of callbacks on the structure.  This exchange is
- * fully ordered with respect to the callers accesses both before and after.
- */
-static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
-{
-#ifdef CONFIG_RCU_NOCB_CPU
-   return atomic_long_xchg(>len, v);
-#else
-   long ret = rsclp->len;
-
-   smp_mb(); /* Up to the caller! */
-   WRITE_ONCE(rsclp->len, v);
-   smp_mb(); /* Up to the caller! */
-   return ret;
-#endif
-}
-
-/*
  * Initialize an rcu_segcblist structure.
  */
 void rcu_segcblist_init(struct rcu_segcblist *rsclp)
@@ -209,8 +229,10 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq));
BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq));
rsclp->head = NULL;
-   for (i = 0; i < RCU_CBLIST_NSEGS; i++)
+   for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
rsclp->tails[i] = >head;
+   rcu_segcblist_set_seglen(rsclp, i, 0);
+   }
rcu_segcblist_set_len(rs

[tip: core/rcu] rcu/segcblist: Add debug checks for segment lengths

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: b4e6039e8af8c20dfbbdfcaebfcbd7c9d9ffe713
Gitweb:
https://git.kernel.org/tip/b4e6039e8af8c20dfbbdfcaebfcbd7c9d9ffe713
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 18 Nov 2020 11:15:41 -05:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/segcblist: Add debug checks for segment lengths

This commit adds debug checks near the end of rcu_do_batch() that emit
warnings if an empty rcu_segcblist structure has non-zero segment counts,
or, conversely, if a non-empty structure has all-zero segment counts.

Signed-off-by: Joel Fernandes (Google) 
[ paulmck: Fix queue/segment-length checks. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 12 
 kernel/rcu/rcu_segcblist.h |  3 +++
 kernel/rcu/tree.c  |  8 ++--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 1e80a0a..89e0dff 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,6 +94,18 @@ static long rcu_segcblist_get_seglen(struct rcu_segcblist 
*rsclp, int seg)
return READ_ONCE(rsclp->seglen[seg]);
 }
 
+/* Return number of callbacks in segmented callback list by summing seglen. */
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp)
+{
+   long len = 0;
+   int i;
+
+   for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
+   len += rcu_segcblist_get_seglen(rsclp, i);
+
+   return len;
+}
+
 /* Set the length of a segment of the rcu_segcblist structure. */
 static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
 {
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9f..18e101d 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,6 +15,9 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in segmented callback list by summing seglen. */
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
+
 void rcu_cblist_init(struct rcu_cblist *rclp);
 void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6bf269c..8086c04 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2434,6 +2434,7 @@ int rcutree_dead_cpu(unsigned int cpu)
 static void rcu_do_batch(struct rcu_data *rdp)
 {
int div;
+   bool __maybe_unused empty;
unsigned long flags;
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
@@ -2548,9 +2549,12 @@ static void rcu_do_batch(struct rcu_data *rdp)
 * The following usually indicates a double call_rcu().  To track
 * this down, try building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.
 */
-   WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(>cblist));
+   empty = rcu_segcblist_empty(>cblist);
+   WARN_ON_ONCE(count == 0 && !empty);
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-count != 0 && rcu_segcblist_empty(>cblist));
+count != 0 && empty);
+   WARN_ON_ONCE(count == 0 && rcu_segcblist_n_segment_cbs(>cblist) != 
0);
+   WARN_ON_ONCE(!empty && rcu_segcblist_n_segment_cbs(>cblist) == 0);
 
rcu_nocb_unlock_irqrestore(rdp, flags);
 


[tip: core/rcu] rcu/tree: segcblist: Remove redundant smp_mb()s

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 68804cf1c905ce227e4e1d0bc252c216811c59fd
Gitweb:
https://git.kernel.org/tip/68804cf1c905ce227e4e1d0bc252c216811c59fd
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 14 Oct 2020 18:21:53 -04:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/tree: segcblist: Remove redundant smp_mb()s

The full memory barriers in rcu_segcblist_enqueue() and in rcu_do_batch()
are not needed because rcu_segcblist_add_len(), and thus also
rcu_segcblist_inc_len(), already includes a memory barrier *before*
and *after* the length of the list is updated.

This commit therefore removes these redundant smp_mb() invocations.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 1 -
 kernel/rcu/tree.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 804..1e80a0a 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -327,7 +327,6 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
   struct rcu_head *rhp)
 {
rcu_segcblist_inc_len(rsclp);
-   smp_mb(); /* Ensure counts are updated before callback is enqueued. */
rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
rhp->next = NULL;
WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cc6f379..b0fb654 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2523,7 +2523,6 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/* Update counts and requeue any remaining callbacks. */
rcu_segcblist_insert_done_cbs(>cblist, );
-   smp_mb(); /* List handling before counting for rcu_barrier(). */
rcu_segcblist_add_len(>cblist, -count);
 
/* Reinstate batch limit if we have worked down the excess. */


[tip: core/rcu] rcu/trace: Add tracing for how segcb list changes

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 3afe7fa535491ecd0382c3968dc2349602bff8a2
Gitweb:
https://git.kernel.org/tip/3afe7fa535491ecd0382c3968dc2349602bff8a2
Author:Joel Fernandes (Google) 
AuthorDate:Sat, 14 Nov 2020 14:31:32 -05:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/trace: Add tracing for how segcb list changes

This commit adds tracing to track how the segcb list changes before/after
acceleration, during queuing and during dequeuing.

This tracing helped discover an optimization that avoided needless GP
requests when no callbacks were accelerated. The tracing overhead is
minimal as each segment's length is now stored in the respective segment.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h | 26 ++
 kernel/rcu/tree.c  |  9 +
 2 files changed, 35 insertions(+)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 155b5cb..5fc2940 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -505,6 +505,32 @@ TRACE_EVENT_RCU(rcu_callback,
  __entry->qlen)
 );
 
+TRACE_EVENT_RCU(rcu_segcb_stats,
+
+   TP_PROTO(struct rcu_segcblist *rs, const char *ctx),
+
+   TP_ARGS(rs, ctx),
+
+   TP_STRUCT__entry(
+   __field(const char *, ctx)
+   __array(unsigned long, gp_seq, RCU_CBLIST_NSEGS)
+   __array(long, seglen, RCU_CBLIST_NSEGS)
+   ),
+
+   TP_fast_assign(
+   __entry->ctx = ctx;
+   memcpy(__entry->seglen, rs->seglen, RCU_CBLIST_NSEGS * 
sizeof(long));
+   memcpy(__entry->gp_seq, rs->gp_seq, RCU_CBLIST_NSEGS * 
sizeof(unsigned long));
+
+   ),
+
+   TP_printk("%s seglen: (DONE=%ld, WAIT=%ld, NEXT_READY=%ld, 
NEXT=%ld) "
+ "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, 
NEXT=%lu)", __entry->ctx,
+ __entry->seglen[0], __entry->seglen[1], 
__entry->seglen[2], __entry->seglen[3],
+ __entry->gp_seq[0], __entry->gp_seq[1], 
__entry->gp_seq[2], __entry->gp_seq[3])
+
+);
+
 /*
  * Tracepoint for the registration of a single RCU callback of the special
  * kvfree() form.  The first argument is the RCU type, the second argument
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b0fb654..6bf269c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1495,6 +1495,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
if (!rcu_segcblist_pend_cbs(>cblist))
return false;
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbPreAcc"));
+
/*
 * Callbacks are often registered with incomplete grace-period
 * information.  Something about the fact that getting exact
@@ -1515,6 +1517,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
else
trace_rcu_grace_period(rcu_state.name, gp_seq_req, 
TPS("AccReadyCB"));
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbPostAcc"));
+
return ret;
 }
 
@@ -2471,11 +2475,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
rcu_segcblist_extract_done_cbs(>cblist, );
if (offloaded)
rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(>cblist);
+
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbDequeued"));
rcu_nocb_unlock_irqrestore(rdp, flags);
 
/* Invoke callbacks. */
tick_dep_set_task(current, TICK_DEP_BIT_RCU);
rhp = rcu_cblist_dequeue();
+
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
@@ -2987,6 +2994,8 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
trace_rcu_callback(rcu_state.name, head,
   rcu_segcblist_n_cbs(>cblist));
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCBQueued"));
+
/* Go handle any RCU core processing required. */
if (unlikely(rcu_segcblist_is_offloaded(>cblist))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Joel Fernandes
On Thu, Feb 04, 2021 at 03:52:53PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > +static void sched_core_update_cookie(struct task_struct *p, unsigned long 
> > cookie,
> > +enum sched_core_cookie_type cookie_type)
> > +{
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +
> > +   if (!p)
> > +   return;
> > +
> > +   rq = task_rq_lock(p, );
> > +
> > +   switch (cookie_type) {
> > +   case sched_core_task_cookie_type:
> > +   p->core_task_cookie = cookie;
> > +   break;
> > +   case sched_core_group_cookie_type:
> > +   p->core_group_cookie = cookie;
> > +   break;
> > +   default:
> > +   WARN_ON_ONCE(1);
> > +   }
> > +
> > +   /* Set p->core_cookie, which is the overall cookie */
> > +   __sched_core_update_cookie(p);
> > +
> > +   if (sched_core_enqueued(p)) {
> > +   sched_core_dequeue(rq, p);
> > +   if (!p->core_cookie) {
> > +   task_rq_unlock(rq, p, );
> > +   return;
> > +   }
> > +   }
> > +
> > +   if (sched_core_enabled(rq) &&
> > +   p->core_cookie && task_on_rq_queued(p))
> > +   sched_core_enqueue(task_rq(p), p);
> > +
> > +   /*
> > +* If task is currently running or waking, it may not be compatible
> > +* anymore after the cookie change, so enter the scheduler on its CPU
> > +* to schedule it away.
> > +*/
> > +   if (task_running(rq, p) || p->state == TASK_WAKING)
> > +   resched_curr(rq);
> 
> I'm not immediately seeing the need for that WAKING test. Since we're
> holding it's rq->lock, the only place that task can be WAKING is on the
> wake_list. And if it's there, it needs to acquire rq->lock to get
> enqueued, and rq->lock again to get scheduled.
> 
> What am I missing?

Hi Peter,

I did this way following a similar pattern in affine_move_task(). However, I
think you are right. Unlike in the case affine_move_task(), we have
schedule() to do the right thing for us in case of any races with wakeup. So
the TASK_WAKING test is indeed not needed and we can drop tha test. Apologies
for adding the extra test out of paranoia.

thanks,

 - Joel



Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Joel Fernandes
Hi Peter,

On Thu, Feb 04, 2021 at 02:59:58PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 03, 2021 at 05:51:15PM +0100, Peter Zijlstra wrote:
> > 
> > I'm slowly starting to go through this...
> > 
> > On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > > +static bool sched_core_empty(struct rq *rq)
> > > +{
> > > + return RB_EMPTY_ROOT(>core_tree);
> > > +}
> > > +
> > > +static struct task_struct *sched_core_first(struct rq *rq)
> > > +{
> > > + struct task_struct *task;
> > > +
> > > + task = container_of(rb_first(>core_tree), struct task_struct, 
> > > core_node);
> > > + return task;
> > > +}
> > 
> > AFAICT you can do with:
> > 
> > static struct task_struct *sched_core_any(struct rq *rq)
> > {
> > return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
> > }
> > 
> > > +static void sched_core_flush(int cpu)
> > > +{
> > > + struct rq *rq = cpu_rq(cpu);
> > > + struct task_struct *task;
> > > +
> > > + while (!sched_core_empty(rq)) {
> > > + task = sched_core_first(rq);
> > > + rb_erase(>core_node, >core_tree);
> > > + RB_CLEAR_NODE(>core_node);
> > > + }
> > > + rq->core->core_task_seq++;
> > > +}
> > 
> > However,
> > 
> > > + for_each_possible_cpu(cpu) {
> > > + struct rq *rq = cpu_rq(cpu);
> > > +
> > > + WARN_ON_ONCE(enabled == rq->core_enabled);
> > > +
> > > + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
> > > >= 2)) {
> > > + /*
> > > +  * All active and migrating tasks will have already
> > > +  * been removed from core queue when we clear the
> > > +  * cgroup tags. However, dying tasks could still be
> > > +  * left in core queue. Flush them here.
> > > +  */
> > > + if (!enabled)
> > > + sched_core_flush(cpu);
> > > +
> > > + rq->core_enabled = enabled;
> > > + }
> > > + }
> > 
> > I'm not sure I understand. Is the problem that we're still schedulable
> > during do_exit() after cgroup_exit() ?

Yes, exactly. Tim had written this code in the original patches and it
carried (I was not involved at that time). IIRC, the issue is the exit will
race with core scheduling being disabled. Even after core sched is disabled,
it will still exist in the core rb tree and needs to be removed. Otherwise it
causes crashes.

> It could be argued that when we
> > leave the cgroup there, we should definitely leave the tag group too.
> 
> That is, did you forget to implement cpu_cgroup_exit()?

Yes, I think it is better to implement it in cpu_cgroup_exit().

thanks,

 - Joel



Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Joel Fernandes
On Fri, Jan 29, 2021 at 06:27:27PM +0100, Vincent Guittot wrote:
[...]
> > update_blocked_averages with preempt and irq off is not a good thing
> > because we don't manage the number of csf_rq to update and I'm going
> > to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm 
> working on an
> additonnal patch that will select this cpu that is about to become idle, 
> instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly 
> idle.
> 
> Signed-off-by: Vincent Guittot 

I confirmed that with this patch, I don't see the preemptoff issues related
to update_blocked_averages() anymore (tested using preemptoff tracer).

I went through the patch and it looks correct to me, I will further review it
and await further reviews from others as well, and then backport the patch to
our kernels. Thanks Vince and everyone!

Tested-by: Joel Fernandes (Google) 

thanks,

 - Joel



> ---
>  kernel/sched/fair.c | 32 +++-
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK   0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED  0x08
> -#define LBF_NOHZ_STATS   0x10
> -#define LBF_NOHZ_AGAIN   0x20
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
>   sgs->group_load += cpu_load(rq);
>   sgs->group_util += cpu_util(i);
>   sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats tmp_sgs;
>   int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>   do {
>   struct sg_lb_stats *sgs = _sgs;
>   int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   /* Tag domain that child domain prefers tasks go to siblings first */
>   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
>   if (env->sd->flags & SD_NUMA)
>   env->fbq_type = fbq_classify_group(>busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>   return;
>  
>   raw_spin_unlock(_rq->lock);
> - /*
> -  * This CPU is going to be idle and blocked load of idle CPUs
> -  * need to be updated. Run the ilb locally as it is a good
> -  * candidate for ilb instead of waking up another idle CPU.
> -  * Kick an normal ilb if we failed to do the update.
> -  */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
>   raw_spin_lock(_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - nohz_newidle_balance(this_rq);
> -
>   goto out;
>   }
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);
>  
>   rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1
> 
> 
> > 
> > > for this.
> > >
> > > > Also update_blocked_aver

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-01 Thread Joel Fernandes
On Fri, Jan 29, 2021 at 5:33 AM Vincent Guittot
 wrote:
[...]
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see 
> > > > it
> > > > happen with other load balance as well, but it not that often as those 
> > > > LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not 
> > > > happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved.
> As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.

I disagree. Here you are assuming that schbench is the only task
running on the system. There are other processes, daemons as well. I
see a strong correlation between commenting out
update_blocked_averages() and not seeing the latency hit at the higher
percentiles.

> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473

Sure, there could be a different problem causing these higher
percentile latencies on your device. I still think 12ms is awful.

> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400

Yes, definitely changing the governor also solves the problem (like
for example if performance governor is used). The problem happens at
low frequencies.

> So a quick and wrong conclusion could be to say that you should disable EAS 
> ;-)

Sure, except the power management guys may come after me ;-)

[..]
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-28 Thread Joel Fernandes
Hi Vincent,

On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
 wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
> > > wrote:
> > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > >  wrote:
> > > > > >
> > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I 
> > > > > > found
> > > > > > that there are a lot of calls to update_blocked_averages(). This 
> > > > > > causes
> > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > times (due to newidle load balance). I have also seen this manifest 
> > > > > > in
> > > > > > the periodic balancer.
> > > > > >
> > > > > > Closer look shows that the problem is caused by the following
> > > > > > ingredients:
> > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > __update_blocked_fair() take a long time.
> > > > >
> > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > the duration
> > > >
> > > > I meant blocked CGroups. According to this code, a cfs_rq can be 
> > > > partially
> > > > decayed and not have any tasks running on it but its load needs to be
> > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > 'blocked'.
> > >
> > > How many blocked cgroups have you got ?
> >
> > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > this loop runs per new idle balance. When the problem happens I see this 
> > loop
> > run 35-40 times (for one single instance of newidle balance). So in total
> > there are at least these many cfs_rq load updates.
>
> Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

All CPUs.

> > I also see that new idle balance can be called 200-500 times per second.
>
> This is not surprising because newidle_balance() is called every time
> the CPU is about to become idle

Sure.

> > > >
> > > >   * There can be a lot of idle CPU cgroups.  Don't let 
> > > > fully
> > > >   * decayed cfs_rqs linger on the list.
> > > >   */
> > > >  if (cfs_rq_is_decayed(cfs_rq))
> > > >  list_del_leaf_cfs_rq(cfs_rq);
> > > >
> > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil 
> > > > > > in a
> > > > > > shared frequency domain configuration to be slower than usual. (the 
> > > > > > load
> > > > >
> > > > > What do you mean exactly by it causes schedutil to be slower than 
> > > > > usual ?
> > > >
> > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > frequency update needs to go through utilization of other CPUs in the
> > > > cluster. I believe this could be adding to the problem but is not really
> > > > needed to optimize if we can rate limit the calls to 
> > > > update_blocked_averages
> > > > to begin with.
> > >
> > > Qais mentioned half of the time being used by
> > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > this call ?
> >
> > I do not see a frequency update happening at the time of the problem. 
> > However
> > note that sugov_iowait_boost() does run even if frequency is not being
> > updated. IIRC, this function is also not that light weight and I am not sure
> > if it is a good idea to call this that often.
>
> Scheduler can't make any assumption about how often schedutil/cpufreq
> wants to be called. Some are fast and straightforward and can be
> called very often to adjust frequency; Others can't handle much
> updates. The rate limit mechanism in schedutil and io-boost should be
> there for such purpose.

Sure, I know that's the intention.

> > > > > > average updates also try to update the frequency

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-27 Thread Joel Fernandes
Hi Vincent,

On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  wrote:
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > >  wrote:
> > > >
> > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > times (due to newidle load balance). I have also seen this manifest in
> > > > the periodic balancer.
> > > >
> > > > Closer look shows that the problem is caused by the following
> > > > ingredients:
> > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > __update_blocked_fair() take a long time.
> > >
> > > Inactive cgroups are removed from the list so they should not impact
> > > the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> 
> How many blocked cgroups have you got ?

I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
this loop runs per new idle balance. When the problem happens I see this loop
run 35-40 times (for one single instance of newidle balance). So in total
there are at least these many cfs_rq load updates.

I also see that new idle balance can be called 200-500 times per second.

> >
> >   * There can be a lot of idle CPU cgroups.  Don't let fully
> >   * decayed cfs_rqs linger on the list.
> >   */
> >  if (cfs_rq_is_decayed(cfs_rq))
> >  list_del_leaf_cfs_rq(cfs_rq);
> >
> > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > shared frequency domain configuration to be slower than usual. (the load
> > >
> > > What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> 
> Qais mentioned half of the time being used by
> sugov_next_freq_shared(). Are there any frequency changes resulting in
> this call ?

I do not see a frequency update happening at the time of the problem. However
note that sugov_iowait_boost() does run even if frequency is not being
updated. IIRC, this function is also not that light weight and I am not sure
if it is a good idea to call this that often.

> > > > average updates also try to update the frequency in schedutil).
> > > >
> > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > code paths to take longer than when running at a high CPU frequency.
> > >
> > > Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to 
> > reproduce.
> 
> Happening a lot in itself is not a problem if there is nothing else to
> do so it's not a argument in itself

It is a problem - it shows up in the preempt off critical section latency
tracer. Are you saying its Ok for preemption to be disabled on system for 500
micro seconds?  That hurts real-time applications (audio etc).

> So why is it a problem for you ? You are mentioning newly idle load
> balance so I assume that your root problem is the scheduling delay
> generated by the newly idle load balance which then calls
> update_blocked_averages

Yes, the new idle balance is when I see it happen quite often. I do see it
happen with other load balance as well, but it not that often as those LB
don't run as often as new idle balance.

> 
> rate limiting the call to update_blocked_averages() will only reduce
> the number of time it happens but it will not prevent it to happen.

Sure, but soft real-time issue can tolerate if the issue does not happen very
often. In this case though, it is frequent.

> IIUC, your real problem is that newidle_balance is

[PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-01-22 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Co-developed-by: Josh Don 
Co-developed-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  10 +
 include/uapi/linux/prctl.h   |   6 +
 kernel/fork.c|   1 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  | 136 ++-
 kernel/sched/coretag.c   | 669 +++
 kernel/sched/debug.c |   4 +
 kernel/sched/sched.h |  58 ++-
 kernel/sys.c |   7 +
 tools/include/uapi/linux/prctl.h |   6 +
 10 files changed, 878 insertions(+), 20 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..7ca6f2f72cda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2076,4 +2078,12 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_core_share_pid(flags, pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..f8e4e9626121 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM  1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO2  /* push core_sched cookie to 
pid */
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20125431af87..a3844e2e7379 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -136,7 +136,33 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+st

[PATCH v10 3/5] kselftest: Add tests for core-sched interface

2021-01-22 Thread Joel Fernandes (Google)
Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 .../testing/selftests/sched/test_coresched.c  | 716 ++
 4 files changed, 732 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..4d18a0a727c8
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,716 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR0
+# define PR_SCHED_CORE_SHARE_FROM   1
+# define PR_SCHED_CORE_SHARE_TO 2
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+   printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+   printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+   if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+   }
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+   char path[50] = {}, *val;
+   int fd;
+
+   sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+   fd = open(path, O_RDONLY, 0666);
+   if (fd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   val = calloc(1, 50);
+   if (read(fd, val, 50) == -1) {
+   perror("Failed to read group cookie: ");
+   abort();
+   }
+
+   val[strcspn(val, "\r\n")] = 0;
+
+   close(fd);
+   return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+   char tag_path[50] = {}, rdbuf[8] = {};
+   int tfd;
+
+   sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+   tfd = open(tag_path, O_RDONLY, 0666);
+   if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+   }
+
+   if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match

[PATCH v10 1/5] sched: migration changes for core scheduling

2021-01-22 Thread Joel Fernandes (Google)
From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 33 +---
 kernel/sched/sched.h | 72 
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f90765f7fd..fddd7c44bbf3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3efcbc779a75..d6efb1ffc08c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1122,6 +1122,7 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled)

[PATCH v10 5/5] sched: Debug bits...

2021-01-22 Thread Joel Fernandes (Google)
Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 35 ++-
 kernel/sched/fair.c |  9 +
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a3844e2e7379..56ba2ca4f922 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -296,12 +300,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5237,6 +5245,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
rq->core_pick = NULL;
return next;
}
@@ -5331,6 +5346,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5347,6 +5365,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5368,6 +5388,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -5409,13 +5430,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5455,6 +5484,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fddd7c44bbf3..ebeeebc4223a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10769,6 +10769,15 @@ static void se_fi_update(struct sched_entity *se, 
unsigned int fi_seq, bool forc
 

[PATCH v10 4/5] Documentation: Add core scheduling documentation

2021-01-22 Thread Joel Fernandes (Google)
Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser 
Co-developed-by: Vineeth Pillai 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 263 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 264 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..a795747c706a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,263 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that can be co-scheduled
+on the same core.
+The core scheduler uses this information to make sure that tasks that are not
+in the same group never run simultaneously on a core, while doing its best to
+satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl interface
+###
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+
+::
+
+#include 
+
+int prctl(int option, unsigned long arg2, unsigned long arg3,
+unsigned long arg4, unsigned long arg5);
+
+option:
+``PR_SCHED_CORE_SHARE``
+
+arg2:
+- ``PR_SCHED_CORE_CLEAR0  -- clear core_sched cookie of pid``
+- ``PR_SCHED_CORE_SHARE_FROM   1  -- get core_sched cookie from pid``
+- ``PR_SCHED_CORE_SHARE_TO 2  -- push core_sched cookie to pid``
+
+arg3:
+``tid`` of the task for which the operation applies
+
+arg4 and arg5:
+MUST be equal to 0.
+
+Creation
+
+Creation is accomplished by sharing a ''cookie'' from a process not currently 
in
+a core scheduling group.
+
+::
+
+if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_tid, 0, 0) < 
0)
+handle_error("src_tid sched_core failed");
+
+Removal
+~~~
+Removing a task from a core scheduling group is done by:
+
+::
+
+if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_CLEAR, clr_tid, 0, 0) < 
0)
+ handle_error("clr_tid sched_core failed");
+
+Cookie Transferal
+~
+Transferring a cookie between the current and other tasks is possible using
+PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
+specified task or

[PATCH v10 0/5] Core scheduling remaining patches

2021-01-22 Thread Joel Fernandes (Google)
 weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Peter Zijlstra (1):
sched: CGroup tagging interface for core scheduling

.../admin-guide/hw-vuln/core-scheduling.rst   | 263 +++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
include/linux/sched.h |  10 +
include/uapi/linux/prctl.h|   6 +
kernel/fork.c |   1 +
kernel/sched/Makefile |   1 +
kernel/sched/core.c   | 171 -
kernel/sched/coretag.c| 669 
kernel/sched/debug.c  |   4 +
kernel/sched/fair.c   |  42 +-
kernel/sched/sched.h  | 130 +++-
kernel/sys.c  |   7 +
tools/include/uapi/linux/prctl.h  |   6 +
tools/testing/selftests/sched/.gitignore  |   1 +
tools/testing/selftests/sched/Makefile|  14 +
tools/testing/selftests/sched/config  |   1 +
.../testing/selftests/sched/test_coresched.c  | 716 ++
17 files changed, 2018 insertions(+), 25 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.30.0.280.ga3ce27912f-goog



Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes
On Fri, Jan 22, 2021 at 06:39:27PM +, Qais Yousef wrote:
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > > force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > -   if (!force && !time_after(jiffies, 
> > > rq->last_blocked_load_update_tick))
> > > +   if (!force && !time_after(jiffies, 
> > > rq->last_blocked_load_update_tick + (HZ/20)))
> > 
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
> 
> This was meant as an RFC patch to discuss the problem really.

Agreed, sorry.

> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in 
> sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in
> return. Correct me if my numbers are wrong Joel.

Correct, and I see update_nohz_stats() itself called around 8 times during a
load balance which multiplies the overhead.

Dietmar found out also that the reason for update_nohz_stacks() being called
8 times is because in our setup, there is only 1 MC sched domain with all 8
CPUs, versus say 2 MC domains with 4 CPUs each.

> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.

Agreed.

> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.

True!

> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it 
> seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.

Correct.

thanks,

 - Joel

[..]


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes
Hi Vincent,

Thanks for reply. Please see the replies below:

On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>  wrote:
> >
> > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > that there are a lot of calls to update_blocked_averages(). This causes
> > the schedule loop to slow down to taking upto 500 micro seconds at
> > times (due to newidle load balance). I have also seen this manifest in
> > the periodic balancer.
> >
> > Closer look shows that the problem is caused by the following
> > ingredients:
> > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > suggesting to inspect /proc/sched_debug for this), this can make
> > __update_blocked_fair() take a long time.
> 
> Inactive cgroups are removed from the list so they should not impact
> the duration

I meant blocked CGroups. According to this code, a cfs_rq can be partially
decayed and not have any tasks running on it but its load needs to be
decayed, correct? That's what I meant by 'inactive'. I can reword it to
'blocked'.

  * There can be a lot of idle CPU cgroups.  Don't let fully
  * decayed cfs_rqs linger on the list.
  */
 if (cfs_rq_is_decayed(cfs_rq))
 list_del_leaf_cfs_rq(cfs_rq);

> > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > shared frequency domain configuration to be slower than usual. (the load
> 
> What do you mean exactly by it causes schedutil to be slower than usual ?

sugov_next_freq_shared() is order number of CPUs in the a cluster. This
system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
frequency update needs to go through utilization of other CPUs in the
cluster. I believe this could be adding to the problem but is not really
needed to optimize if we can rate limit the calls to update_blocked_averages
to begin with.

> > average updates also try to update the frequency in schedutil).
> >
> > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > code paths to take longer than when running at a high CPU frequency.
> 
> Low frequency usually means low utilization so it should happen that much.

It happens a lot as can be seen with schbench. It is super easy to reproduce.

schedule() can result in new idle balance with the CFS pick call happening
often. Here is a function graph trace.  The tracer shows
update_blocked_averages taking a lot of time.

 sugov:0-2454  [002]  2657.992570: funcgraph_entry:   |  
load_balance() {
 sugov:0-2454  [002]  2657.992577: funcgraph_entry:   |
update_group_capacity() {
 sugov:0-2454  [002]  2657.992580: funcgraph_entry:2.656 us   | 
 __msecs_to_jiffies();
 sugov:0-2454  [002]  2657.992585: funcgraph_entry:2.447 us   | 
 _raw_spin_lock_irqsave();
 sugov:0-2454  [002]  2657.992591: funcgraph_entry:2.552 us   | 
 _raw_spin_unlock_irqrestore();
 sugov:0-2454  [002]  2657.992595: funcgraph_exit:   + 17.448 us  |}
 sugov:0-2454  [002]  2657.992597: funcgraph_entry:1.875 us   |
update_nohz_stats();
 sugov:0-2454  [002]  2657.992601: funcgraph_entry:1.667 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992605: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992608: funcgraph_entry:  + 33.333 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992643: funcgraph_exit:   + 38.073 us  |}
 sugov:0-2454  [002]  2657.992645: funcgraph_entry:1.770 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992649: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992651: funcgraph_entry:  + 41.823 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992694: funcgraph_exit:   + 45.729 us  |}
 sugov:0-2454  [002]  2657.992696: funcgraph_entry:1.823 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992700: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992702: funcgraph_entry:  + 35.312 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992740: funcgraph_exit:   + 39.792 us  |}
 sugov:0-2454  [002]  2657.992742: funcgraph_entry:1.771 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992746: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992748: funcgraph_entry:  + 33.438 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992783: funcgraph_exit:   + 37.500 us  |}
 sugov:0-2454  [002]  2657.992785: funcgraph_entry:1.771 us

[PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes (Google)
On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
that there are a lot of calls to update_blocked_averages(). This causes
the schedule loop to slow down to taking upto 500 micro seconds at
times (due to newidle load balance). I have also seen this manifest in
the periodic balancer.

Closer look shows that the problem is caused by the following
ingredients:
1. If the system has a lot of inactive CGroups (thanks Dietmar for
suggesting to inspect /proc/sched_debug for this), this can make
__update_blocked_fair() take a long time.

2. The device has a lot of CPUs in a cluster which causes schedutil in a
shared frequency domain configuration to be slower than usual. (the load
average updates also try to update the frequency in schedutil).

3. The CPU is running at a low frequency causing the scheduler/schedutil
code paths to take longer than when running at a high CPU frequency.

The fix is simply rate limit the calls to update_blocked_averages to 20
times per second. It appears that updating the blocked average less
often is sufficient. Currently I see about 200 calls per second
sometimes, which seems overkill.

schbench shows a clear improvement with the change:

Without patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (212 total samples)
50.0th: 210 (106 samples)
75.0th: 619 (53 samples)
90.0th: 665 (32 samples)
95.0th: 703 (11 samples)
*99.0th: 12656 (8 samples)
99.5th: 12784 (1 samples)
99.9th: 13424 (1 samples)
min=15, max=13424

With patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (214 total samples)
50.0th: 188 (108 samples)
75.0th: 238 (53 samples)
90.0th: 623 (32 samples)
95.0th: 657 (12 samples)
*99.0th: 717 (7 samples)
99.5th: 725 (2 samples)
99.9th: 725 (0 samples)

Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Suggested-by: Dietmar Eggeman 
Co-developed-by: Qais Yousef 
Signed-off-by: Qais Yousef 
Signed-off-by: Joel Fernandes (Google) 

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..fe2dc0024db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
return false;
 
-   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
+   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + 
(HZ/20)))
return true;
 
update_blocked_averages(cpu);
-- 
2.30.0.280.ga3ce27912f-goog



Re: [PATCH -tip 32/32] sched: Debug bits...

2021-01-15 Thread Joel Fernandes
On Tue, Dec 01, 2020 at 11:21:37AM +1100, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:20:02PM -0500, Joel Fernandes (Google) wrote:
> > Tested-by: Julien Desfossez 
> > Not-Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> 
> May be put it under a #ifdef CONFIG_SCHED_CORE_DEBUG, even then please
> make it more driven by selection via tracing rather than just trace_printk()

This particular patch is only for debug and is not for merging.

Peter is preparing a tree with some patches already applied, once that's done
we will send a new series with the remaining patches (mostly interface and
docs left).

thanks,

 - Joel




Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

2020-12-15 Thread Joel Fernandes
Hi Marc, Quentin,

On Fri, Dec 11, 2020 at 4:34 AM Quentin Perret  wrote:
>
> On Thursday 10 Dec 2020 at 08:45:22 (+), Marc Zyngier wrote:
> > On 2020-12-10 01:39, Joel Fernandes wrote:
> >
> > [...]
> >
> > > > Quentin and I have discussed potential ways of improving guest
> > > > scheduling
> > > > on terminally broken systems (otherwise known as big-little), in the
> > > > form of a capacity request from the guest to the host. I'm not really
> > > > keen on the host exposing its own capacity, as that doesn't tell the
> > > > host what the guest actually needs.
> > >
> > > I am not sure how a capacity request could work well. It seems the
> > > cost of a repeated hypercall could be prohibitive. In this case, a
> > > lighter approach might be for KVM to restrict vCPU threads to run on
> > > certain types of cores, and pass the capacity information to the guest
> > > at guest's boot time.
> >
> > That seems like a very narrow use case. If you actually pin vcpus to
> > physical CPU classes, DT is the right place to put things, because
> > it is completely static. This is effectively creating a virtual
> > big-little, which is in my opinion a userspace job.
>
> +1, all you should need for this is to have the VMM pin the vCPUS and
> set capacity-dmips-mhz in the guest DT accordingly. And if you're
> worried about sharing the runqueue with host tasks, could you vacate the
> host CPUs using cpusets or such?

I agree, the VMM is the right place for it with appropriate DT
settings. I think this is similar to how CPUID is emulated on Intel as
well (for example to specify SMT topology for a vCPU) -- it is done by
the VMM.

On sharing vCPU with host tasks, that is indeed an issue because the
host does not know the priority of an app (For example, a "top app"
running in Android in a VM). The sharing with host tasks should be Ok
as long as the scheduler priorities of the vCPU threads on the host
are setup correctly?

> The last difficult bit is how to drive DVFS. I suppose Marc's suggestion
> to relay capacity requests from the guest would help with that.

Yeah I misunderstood Marc.  I think for DVFS, a hypercall for capacity
request should work and be infrequent enough. IIRC, there is some rate
limiting support in cpufreq governors as well that should reduce the
rate of hypercalls if needed.

> > > This would be a one-time cost to pay. And then,
> > > then the guest scheduler can handle the scheduling appropriately
> > > without any more hypercalls. Thoughts?
> >
> > Anything that is a one-off belongs to firmware configuration, IMO.
> >
> > The case I'm concerned with is when vcpus are allowed to roam across
> > the system, and hit random physical CPUs because the host has no idea
> > of the workload the guest deals with (specially as the AMU counters
> > are either absent or unusable on any available core).

It sounds like this might be a usecase for pinning the vCPU threads
appropriately (So designate a set of vCPU threads to only run on bigs
and another set to only run on LITTLEs).  The host can setup the DT to
describe this and the VM kernel's scheduler can do appropriate task
placement.  Did I miss anything?

> > The cost of a hypercall really depends on where you terminate it.
> > If it is a shallow exit, that's only a few hundred cycles on any half
> > baked CPU. Go all the way to userspace, and the host scheduler is the
> > limit. But the frequency of that hypercall obviously matters too.
> >
> > How often do you expect the capacity request to fire? Probably not
> > on each and every time slice, right?
> >
> > Quentin, can you shed some light on this?
>
> Assuming that we change the 'capacity request' (aka uclamp.min of the
> vCPU) every time the guest makes a frequency request, then the answer
> very much is 'it depends on the workload'. Yes there is an overhead, but
> I think it is hard to say how bad that would be before we give it a go.
> It's unfortunately not uncommon to have painfully slow frequency changes
> on real hardware, so this may be just fine. And there may be ways we
> can mitigate this too (with rate limiting and such), so all in all it is
> worth a try.

Agreed.

> Also as per the above, this still would help even if the VMM pins vCPUs
> and such, so these two things can live and complement each other I
> think.

Makes sense.

> Now, for the patch originally under discussion here, no objection from
> me in principle, it looks like a nice improvement to the stolen time
> stuff and I can see how that could help some use-cases, so +1 from me.

Sounds good!

thanks,

 - Joel


Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-15 Thread Joel Fernandes
Hello Dhaval,

On Tue, Dec 15, 2020 at 1:14 PM Dhaval Giani  wrote:
>
> On 12/14/20 3:25 PM, Joel Fernandes wrote:
>
> >> No problem. That was there primarily for debugging.
> > Ok. I squashed Josh's changes into this patch and several of my fixups. So
> > there'll be 3 patches:
> > 1. CGroup + prctl  (single patch as it is hell to split it)
>
> Please don't do that.
> I am not sure we have thought the cgroup interface through
> (looking at all the discussions).

Unfortunately, this comment does not provides any information on the
issues you are concerned about.  If there are specific issues you are
concerned about, please don't keep it a secret! What requirement of
yours is not being met with the CGroup interface the way it is in v8
series?

thanks,

 - Joel


Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-14 Thread Joel Fernandes
On Mon, Dec 14, 2020 at 02:44:09PM -0500, chris hyser wrote:
> On 12/14/20 2:31 PM, Joel Fernandes wrote:
> > > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > > index cffdfab..50c31f3 100644
> > > --- a/kernel/sched/debug.c
> > > +++ b/kernel/sched/debug.c
> > > @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, 
> > > struct pid_namespace *ns,
> > >   #ifdef CONFIG_SCHED_CORE
> > >   __PS("core_cookie", p->core_cookie);
> > > + __PS("core_task_cookie", p->core_task_cookie);
> > >   #endif
> > 
> > Hmm, so the final cookie of the task is always p->core_cookie. This is what
> > the scheduler uses. All other fields are ingredients to derive the final
> > cookie value.
> > 
> > I will drop this hunk from your overall diff, but let me know if you
> > disagree!
> 
> 
> No problem. That was there primarily for debugging.

Ok. I squashed Josh's changes into this patch and several of my fixups. So
there'll be 3 patches:
1. CGroup + prctl  (single patch as it is hell to split it)
2. Documentation
3. ksefltests

Below is the diff of #1. I still have to squash in the stop_machine removal
and some more review changes. But other than that, please take a look and let
me know anything that's odd.  I will test further as well.

Also next series will only be interface as I want to see if I can get lucky
enough to have Peter look at it before he leaves for PTO next week.
For the other features, I will post different series as I prepare them. One
series for interface, and another for kernel protection / migration changes.

---8<---

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a60868165590..73baca11d743 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2081,11 +2083,15 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(flags, pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..3752006842e1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+#define PR_SCHED_CORE_CLEAR0  /* clear core_sched cookie of pid */
+#define PR_SCHED_CORE_SHARE_FROM   1  /* get core_sched cookie from pid */
+#define PR_SCHED_CORE_SHARE_TO 2  /* push core_sched cookie to pid */
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f807a84cc30..80daca9c5930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,7 +157,33 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(&

Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-14 Thread Joel Fernandes
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index cffdfab..50c31f3 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct 
> pid_namespace *ns,
>  
>  #ifdef CONFIG_SCHED_CORE
>   __PS("core_cookie", p->core_cookie);
> + __PS("core_task_cookie", p->core_task_cookie);
>  #endif

Hmm, so the final cookie of the task is always p->core_cookie. This is what
the scheduler uses. All other fields are ingredients to derive the final
cookie value.

I will drop this hunk from your overall diff, but let me know if you
disagree!

thanks,

 - Joel



Re: Energy-efficiency options within RCU

2020-12-14 Thread Joel Fernandes
On Thu, Dec 10, 2020 at 10:37:37AM -0800, Paul E. McKenney wrote:
> Hello, Joel,
> 
> In case you are -seriously- interested...  ;-)

I am always seriously interested :-). The issue becomes when life throws me a
curveball. This was the year of curveballs :-)

Thank you for your reply and I have added it to my list to investigate how we
are configuring nocb on our systems. I don't think anyone over here has given
these RCU issues a serious look over here.

thanks,

 - Joel



>   Thanx, Paul
> 
> rcu_nocbs=
> 
>   Adding a CPU to this list offloads RCU callback invocation from
>   that CPU's softirq handler to a kthread.  In big.LITTLE systems,
>   this kthread can be placed on a LITTLE CPU, which has been
>   demonstrated to save significant energy in benchmarks.
>   
> http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf
> 
> nohz_full=
> 
>   Any CPU specified by this boot parameter is handled as if it was
>   specified by rcu_nocbs=.
> 
> rcutree.jiffies_till_first_fqs=
> 
>   Increasing this will decrease wakeup frequency to the grace-period
>   kthread for the first FQS scan.  And increase grace-period
>   latency.
> 
> rcutree.jiffies_till_next_fqs=
> 
>   Ditto, but for the second and subsequent FQS scans.
> 
>   My guess is that neither of these makes much difference.  But if
>   they do, maybe some sort of backoff scheme for FQS scans?
> 
> rcutree.jiffies_till_sched_qs=
> 
>   Increasing this will delay RCU's getting excited about CPUs and
>   tasks not responding with quiescent states.  This excitement
>   can cause extra overhead.
> 
>   No idea whether adjusting this would help.  But if you increase
>   rcutree.jiffies_till_first_fqs or rcutree.jiffies_till_next_fqs,
>   you might need to increase this one accordingly.
> 
> rcutree.qovld=
> 
>   Increasing this will increase the grace-period duration at which
>   RCU starts sending IPIs, thus perhaps reducing the total number
>   of IPIs that RCU sends.  The destination CPUs are unlikely to be
>   idle, so it is not clear to me that this would help much.  But
>   perhaps I am wrong about them being mostly non-idle, who knows?
> 
> rcupdate.rcu_cpu_stall_timeout=
> 
>   If you get overly zealous about the earlier kernel boot parameters,
>   you might need to increase this one as well.  Or instead use the
>   rcupdate.rcu_cpu_stall_suppress= kernel boot parameter to suppress
>   RCU CPU stall warnings entirely.
> 
> rcutree.rcu_nocb_gp_stride=
> 
>   Increasing this might reduce grace-period work somewhat.  I don't
>   see why a (say) 16-CPU system really needs to have more than one
>   rcuog kthread, so if this does help it might be worthwhile setting
>   a lower limit to this kernel parameter.
> 
> rcutree.rcu_idle_gp_delay=  (Only CONFIG_RCU_FAST_NO_HZ=y kernels.)
> 
>   This defaults to four jiffies on the theory that grace periods
>   tend to last about that long.  If grace periods tend to take
>   longer, then it makes a lot of sense to increase this.  And maybe
>   battery-powered devices would rather have it be about 2x or 3x
>   the expected grace-period duration, who knows?
> 
>   I would keep it to a power of two, but the code should work with
>   other numbers.  Except that I don't know that this has ever been
>   tested.  ;-)
> 
> srcutree.exp_holdoff=
> 
>   Increasing this decreases the number of SRCU grace periods that
>   are treated as expedited.  But you have to have closely-spaced
>   SRCU grace periods for this to matter.  (These do happen at least
>   sometimes because I added this only because someone complained
>   about the performance regression from the earlier non-tree SRCU.)
> 
> rcupdate.rcu_task_ipi_delay=
> 
>   This kernel parameter delays sending IPIs for RCU Tasks Trace,
>   which is used by sleepable BPF programs.  Increasing it can
>   reduce overhead, but can also increase the latency of removing
>   sleepable BPF programs.
> 
> rcupdate.rcu_task_stall_timeout=
> 
>   If you slow down RCU Tasks Trace too much, you may need this.
>   But then again, the default 10-minute value should suffice.
> 
> CONFIG_RCU_FAST_NO_HZ=y
> 
>   This only has effect on CPUs not specified by rcu_nocbs, and thus
>   might be useful on systems that offload RCU callbacks only on
>   some of the CPUs.  For example, a big.LITTLE system might offload
>   only the big CPUs.  This Kconfig option reduces the frequency of
>   timer interrupts (and thus of RCU-related softirq processing)
>   on idle CPUs.  This has been shown to save significant energy
>   in benchmarks:
>   
> http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf
> 
> CONFIG_RCU_STRICT_GRACE_PERIOD=y
> 
>   This works hard (as in burns CPU) to 

Re: [PATCH v12 00/31] Speculative page faults

2020-12-14 Thread Joel Fernandes
On Mon, Dec 14, 2020 at 10:36:29AM +0100, Laurent Dufour wrote:
> Le 14/12/2020 à 03:03, Joel Fernandes a écrit :
> > On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
> > [..]
> > > > > Hi Laurent,
> > > > > 
> > > > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > > > several experiments, we observed SPF has obvious improvements on the
> > > > > launch time of applications, especially for those high-TLP ones,
> > > > > 
> > > > > # launch time of applications(s):
> > > > > 
> > > > > package   version  w/ SPF  w/o SPF  improve(%)
> > > > > --
> > > > > Baidu maps10.13.3  0.887   0.98 9.49
> > > > > Taobao8.4.0.35 1.227   1.2935.10
> > > > > Meituan   9.12.401 1.107   1.54328.26
> > > > > WeChat7.0.32.353   2.68 12.20
> > > > > Honor of Kings1.43.1.6 6.636.7131.24
> > > > 
> > > > That's great news, thanks for reporting this!
> > > > 
> > > > > 
> > > > > By the way, we have verified our platforms with those patches and
> > > > > achieved the goal of mass production.
> > > > 
> > > > Another good news!
> > > > For my information, what is your targeted hardware?
> > > > 
> > > > Cheers,
> > > > Laurent.
> > > 
> > > Hi Laurent,
> > > 
> > > Our targeted hardware belongs to ARM64 multi-core series.
> > 
> > Hello!
> > 
> > I was trying to develop an intuition about why does SPF give improvement for
> > you on small CPU systems. This is just a high-level theory but:
> > 
> > 1. Assume the improvement is because of elimination of "blocking" on
> > mmap_sem.
> > Could it be that the mmap_sem is acquired in write-mode unnecessarily in 
> > some
> > places, thus causing blocking on mmap_sem in other paths? If so, is it
> > feasible to convert such usages to acquiring them in read-mode?
> 
> That's correct, and the goal of this series is to try not holding the
> mmap_sem in read mode during page fault processing.
> 
> Converting mmap_sem holder from write to read mode is not so easy and that
> work as already been done in some places. If you think there are areas where
> this could be done, you're welcome to send patches fixing that.
> 
> > 2. Assume the improvement is because of lesser read-side contention on
> > mmap_sem.
> > On small CPU systems, I would not expect reducing cache-line bouncing to 
> > give
> > such a dramatic improvement in performance as you are seeing.
> 
> I don't think cache line bouncing reduction is the main sourcec of
> performance improvement, I would rather think this is the lower part here.
> I guess this is mainly because during loading time a lot of page fault is
> occuring and thus SPF is reducing the contention on the mmap_sem.

Thanks for the reply. I think I also wrongly assumed that acquiring mmap
rwsem in write mode in a syscall makes SPF moot. Peter explained to me on IRC
that tere's still perf improvement in write mode if an unrelated VMA is
modified while another VMA is faulting.  CMIIW - not an mm expert by any
stretch.

Thanks!

 - Joel



Re: [PATCH v12 00/31] Speculative page faults

2020-12-13 Thread Joel Fernandes
On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
[..]
> > > Hi Laurent,
> > > 
> > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > several experiments, we observed SPF has obvious improvements on the
> > > launch time of applications, especially for those high-TLP ones,
> > > 
> > > # launch time of applications(s):
> > > 
> > > package   version  w/ SPF  w/o SPF  improve(%)
> > > --
> > > Baidu maps10.13.3  0.887   0.98 9.49
> > > Taobao8.4.0.35 1.227   1.2935.10
> > > Meituan   9.12.401 1.107   1.54328.26
> > > WeChat7.0.32.353   2.68 12.20
> > > Honor of Kings1.43.1.6 6.636.7131.24
> > 
> > That's great news, thanks for reporting this!
> > 
> > > 
> > > By the way, we have verified our platforms with those patches and
> > > achieved the goal of mass production.
> > 
> > Another good news!
> > For my information, what is your targeted hardware?
> > 
> > Cheers,
> > Laurent.
> 
> Hi Laurent,
> 
> Our targeted hardware belongs to ARM64 multi-core series.

Hello!

I was trying to develop an intuition about why does SPF give improvement for
you on small CPU systems. This is just a high-level theory but:

1. Assume the improvement is because of elimination of "blocking" on
mmap_sem.
Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
places, thus causing blocking on mmap_sem in other paths? If so, is it
feasible to convert such usages to acquiring them in read-mode?

2. Assume the improvement is because of lesser read-side contention on
mmap_sem.
On small CPU systems, I would not expect reducing cache-line bouncing to give
such a dramatic improvement in performance as you are seeing.

Thanks for any insight on this!

- Joel



[tip: core/rcu] rcu/tree: Add a warning if CPU being onlined did not report QS already

2020-12-13 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 9f866dac94292f93d3b6bf8dbe860a44b954e555
Gitweb:
https://git.kernel.org/tip/9f866dac94292f93d3b6bf8dbe860a44b954e555
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 29 Sep 2020 15:29:27 -04:00
Committer: Paul E. McKenney 
CommitterDate: Thu, 19 Nov 2020 19:37:16 -08:00

rcu/tree: Add a warning if CPU being onlined did not report QS already

Currently, rcu_cpu_starting() checks to see if the RCU core expects a
quiescent state from the incoming CPU.  However, the current interaction
between RCU quiescent-state reporting and CPU-hotplug operations should
mean that the incoming CPU never needs to report a quiescent state.
First, the outgoing CPU reports a quiescent state if needed.  Second,
the race where the CPU is leaving just as RCU is initializing a new
grace period is handled by an explicit check for this condition.  Third,
the CPU's leaf rcu_node structure's ->lock serializes these checks.

This means that if rcu_cpu_starting() ever feels the need to report
a quiescent state, then there is a bug somewhere in the CPU hotplug
code or the RCU grace-period handling code.  This commit therefore
adds a WARN_ON_ONCE() to bring that bug to everyone's attention.

Cc: Neeraj Upadhyay 
Suggested-by: Paul E. McKenney 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 39e14cf..e4d6d0b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4075,7 +4075,9 @@ void rcu_cpu_starting(unsigned int cpu)
rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */
rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
-   if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
+
+   /* An incoming CPU should never be blocking a grace period. */
+   if (WARN_ON_ONCE(rnp->qsmask & mask)) { /* RCU waiting on incoming CPU? 
*/
rcu_disable_urgency_upon_qs(rdp);
/* Report QS -after- changing ->qsmaskinitnext! */
rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);


[tip: core/rcu] rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs

2020-12-13 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: bd56e0a4a291bc9db2cbaddef20ec61a1aad4208
Gitweb:
https://git.kernel.org/tip/bd56e0a4a291bc9db2cbaddef20ec61a1aad4208
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 07 Oct 2020 13:50:36 -07:00
Committer: Paul E. McKenney 
CommitterDate: Thu, 19 Nov 2020 19:37:17 -08:00

rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs

Testing showed that rcu_pending() can return 1 when offloaded callbacks
are ready to execute.  This invokes RCU core processing, for example,
by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
However, rcu_core() explicitly avoids in any way manipulating offloaded
callbacks, which are instead handled by the rcuog and rcuoc kthreads,
which work independently of rcu_core().

One exception to this independence is that rcu_core() invokes
do_nocb_deferred_wakeup(), however, rcu_pending() also checks
rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
invoking rcu_core() when needed.

This commit therefore avoids needlessly invoking RCU core processing
by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
This reduces overhead, for example, by reducing softirq activity.

This change passed 30 minute tests of TREE01 through TREE09 each.

On TREE08, there is at most 150us from the time that rcu_pending() chose
not to invoke RCU core processing to the time when the ready callbacks
were invoked by the rcuoc kthread.  This provides further evidence that
there is no need to invoke rcu_core() for offloaded callbacks that are
ready to invoke.

Cc: Neeraj Upadhyay 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d6a015e..50d90ee 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3718,7 +3718,8 @@ static int rcu_pending(int user)
return 1;
 
/* Does this CPU have callbacks ready to invoke? */
-   if (rcu_segcblist_ready_cbs(>cblist))
+   if (!rcu_segcblist_is_offloaded(>cblist) &&
+   rcu_segcblist_ready_cbs(>cblist))
return 1;
 
/* Has RCU gone idle with this CPU needing another grace period? */


[tip: core/rcu] docs: Update RCU's hotplug requirements with a bit about design

2020-12-13 Thread tip-bot2 for Joel Fernandes (Google)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: a043260740d5d6ec5be59c3fb595c719890a0b0b
Gitweb:
https://git.kernel.org/tip/a043260740d5d6ec5be59c3fb595c719890a0b0b
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 29 Sep 2020 15:29:28 -04:00
Committer: Paul E. McKenney 
CommitterDate: Fri, 06 Nov 2020 17:02:43 -08:00

docs: Update RCU's hotplug requirements with a bit about design

The rcu_barrier() section of the "Hotplug CPU" section discusses
deadlocks, however the description of deadlocks other than those involving
rcu_barrier() is rather incomplete.

This commit therefore continues the section by describing how RCU's
design handles CPU hotplug in a deadlock-free way.

Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 Documentation/RCU/Design/Requirements/Requirements.rst | 49 +++--
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst 
b/Documentation/RCU/Design/Requirements/Requirements.rst
index 1ae79a1..8807985 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -1929,16 +1929,45 @@ The Linux-kernel CPU-hotplug implementation has 
notifiers that are used
 to allow the various kernel subsystems (including RCU) to respond
 appropriately to a given CPU-hotplug operation. Most RCU operations may
 be invoked from CPU-hotplug notifiers, including even synchronous
-grace-period operations such as ``synchronize_rcu()`` and
-``synchronize_rcu_expedited()``.
-
-However, all-callback-wait operations such as ``rcu_barrier()`` are also
-not supported, due to the fact that there are phases of CPU-hotplug
-operations where the outgoing CPU's callbacks will not be invoked until
-after the CPU-hotplug operation ends, which could also result in
-deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
-during its execution, which results in another type of deadlock when
-invoked from a CPU-hotplug notifier.
+grace-period operations such as (``synchronize_rcu()`` and
+``synchronize_rcu_expedited()``).  However, these synchronous operations
+do block and therefore cannot be invoked from notifiers that execute via
+``stop_machine()``, specifically those between the ``CPUHP_AP_OFFLINE``
+and ``CPUHP_AP_ONLINE`` states.
+
+In addition, all-callback-wait operations such as ``rcu_barrier()`` may
+not be invoked from any CPU-hotplug notifier.  This restriction is due
+to the fact that there are phases of CPU-hotplug operations where the
+outgoing CPU's callbacks will not be invoked until after the CPU-hotplug
+operation ends, which could also result in deadlock. Furthermore,
+``rcu_barrier()`` blocks CPU-hotplug operations during its execution,
+which results in another type of deadlock when invoked from a CPU-hotplug
+notifier.
+
+Finally, RCU must avoid deadlocks due to interaction between hotplug,
+timers and grace period processing. It does so by maintaining its own set
+of books that duplicate the centrally maintained ``cpu_online_mask``,
+and also by reporting quiescent states explicitly when a CPU goes
+offline.  This explicit reporting of quiescent states avoids any need
+for the force-quiescent-state loop (FQS) to report quiescent states for
+offline CPUs.  However, as a debugging measure, the FQS loop does splat
+if offline CPUs block an RCU grace period for too long.
+
+An offline CPU's quiescent state will be reported either:
+1.  As the CPU goes offline using RCU's hotplug notifier 
(``rcu_report_dead()``).
+2.  When grace period initialization (``rcu_gp_init()``) detects a
+race either with CPU offlining or with a task unblocking on a leaf
+``rcu_node`` structure whose CPUs are all offline.
+
+The CPU-online path (``rcu_cpu_starting()``) should never need to report
+a quiescent state for an offline CPU.  However, as a debugging measure,
+it does emit a warning if a quiescent state was not already reported
+for that CPU.
+
+During the checking/modification of RCU's hotplug bookkeeping, the
+corresponding CPU's leaf node lock is held. This avoids race conditions
+between RCU's hotplug notifier hooks, the grace period initialization
+code, and the FQS loop, all of which refer to or modify this bookkeeping.
 
 Scheduler and RCU
 ~


Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

2020-12-09 Thread Joel Fernandes
Hi Marc, nice to hear from you.

On Wed, Dec 9, 2020 at 4:43 AM Marc Zyngier  wrote:
>
> Hi all,
>
> On 2020-12-08 20:02, Joel Fernandes wrote:
> > On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
> >  wrote:
> >>
> >> My apologies for the slow reply.
> >>
> >> On (20/08/17 13:25), Marc Zyngier wrote:
> >> >
> >> > It really isn't the same thing at all. You are exposing PV spinlocks,
> >> > while Sergey exposes preemption to vcpus.
> >> >
> >>
> >> Correct, we see vcpu preemption as a "fundamental" feature, with
> >> consequences that affect scheduling, which is a core feature :)
> >>
> >> Marc, is there anything in particular that you dislike about this RFC
> >> patch set? Joel has some ideas, which we may discuss offline if that
> >> works for you.
> >
> > Hi Marc, Sergey, Just checking what is the latest on this series?
>
> I was planning to give it a go, but obviously got sidetracked. :-(

Ah, that happens.

> > About the idea me and Sergey discussed, at a high level we discussed
> > being able to share information similar to "Is the vCPU preempted?"
> > using a more arch-independent infrastructure. I do not believe this
> > needs to be arch-specific. Maybe the speciifc mechanism about how to
> > share a page of information needs to be arch-specific, but the actual
> > information shared need not be.
>
> We already have some information sharing in the form of steal time
> accounting, and I believe this "vcpu preempted" falls in the same
> bucket. It looks like we could implement the feature as an extension
> of the steal-time accounting, as the two concepts are linked
> (one describes the accumulation of non-running time, the other is
> instantaneous).

Yeah I noticed the steal stuff. Will go look more into that.

> > This could open the door to sharing
> > more such information in an arch-independent way (for example, if the
> > scheduler needs to know other information such as the capacity of the
> > CPU that the vCPU is on).
>
> Quentin and I have discussed potential ways of improving guest
> scheduling
> on terminally broken systems (otherwise known as big-little), in the
> form of a capacity request from the guest to the host. I'm not really
> keen on the host exposing its own capacity, as that doesn't tell the
> host what the guest actually needs.

I am not sure how a capacity request could work well. It seems the
cost of a repeated hypercall could be prohibitive. In this case, a
lighter approach might be for KVM to restrict vCPU threads to run on
certain types of cores, and pass the capacity information to the guest
at guest's boot time. This would be a one-time cost to pay. And then,
then the guest scheduler can handle the scheduling appropriately
without any more hypercalls. Thoughts?

- Joel


Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

2020-12-08 Thread Joel Fernandes
On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
 wrote:
>
> My apologies for the slow reply.
>
> On (20/08/17 13:25), Marc Zyngier wrote:
> >
> > It really isn't the same thing at all. You are exposing PV spinlocks,
> > while Sergey exposes preemption to vcpus.
> >
>
> Correct, we see vcpu preemption as a "fundamental" feature, with
> consequences that affect scheduling, which is a core feature :)
>
> Marc, is there anything in particular that you dislike about this RFC
> patch set? Joel has some ideas, which we may discuss offline if that
> works for you.

Hi Marc, Sergey, Just checking what is the latest on this series?

About the idea me and Sergey discussed, at a high level we discussed
being able to share information similar to "Is the vCPU preempted?"
using a more arch-independent infrastructure. I do not believe this
needs to be arch-specific. Maybe the speciifc mechanism about how to
share a page of information needs to be arch-specific, but the actual
information shared need not be. This could open the door to sharing
more such information in an arch-independent way (for example, if the
scheduler needs to know other information such as the capacity of the
CPU that the vCPU is on).

Other thoughts?

thanks,

 - Joel


Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-06 Thread Joel Fernandes
On Tue, Dec 01, 2020 at 08:20:50PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2020 at 02:11:33PM -0500, Joel Fernandes wrote:
> > On Wed, Nov 25, 2020 at 12:15:41PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > > 
> > > > +/*
> > > > + * Ensure that the task has been requeued. The stopper ensures that 
> > > > the task cannot
> > > > + * be migrated to a different CPU while its core scheduler queue state 
> > > > is being updated.
> > > > + * It also makes sure to requeue a task if it was running actively on 
> > > > another CPU.
> > > > + */
> > > > +static int sched_core_task_join_stopper(void *data)
> > > > +{
> > > > +   struct sched_core_task_write_tag *tag = (struct 
> > > > sched_core_task_write_tag *)data;
> > > > +   int i;
> > > > +
> > > > +   for (i = 0; i < 2; i++)
> > > > +   sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], 
> > > > false /* !group */);
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static int sched_core_share_tasks(struct task_struct *t1, struct 
> > > > task_struct *t2)
> > > > +{
> > > 
> > > > +   stop_machine(sched_core_task_join_stopper, (void *), NULL);
> > > 
> > > > +}
> > > 
> > > This is *REALLY* terrible...
> > 
> > I pulled this bit from your original patch. Are you concerned about the
> > stop_machine? Sharing a core is a slow path for our usecases (and as far as 
> > I
> > know, for everyone else's). We can probably do something different if that
> > requirement changes.
> > 
> 
> Yeah.. so I can (and was planning on) remove stop_machine() from
> sched_core_{dis,en}able() before merging it.
> 
> (there's two options, one uses stop_cpus() with the SMT mask, the other
> RCU)

Ok. What about changing the cookie of task T while holding the rq/pi locks, and
then doing a resched_curr(rq) for that RQ?

The holding lock ensures no migration of task happens, and resched_curr()
ensure that task T's rq will enter the scheduler to consider the task T's new
cookie for scheduling. I believe this is analogous to what
__sched_setscheduler() does when you switch a task from CFS to RT.

> This though is exposing stop_machine() to joe user. Everybody is allowed
> to prctl() it's own task and set a cookie on himself. This means you
> just made giant unpriv DoS vector.
> 
> stop_machine is bad, really bad.

Agreed.

thanks,

 - Joel



Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-06 Thread Joel Fernandes
Hi Peter,

On Tue, Dec 01, 2020 at 08:34:51PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2020 at 02:20:28PM -0500, Joel Fernandes wrote:
> > On Wed, Nov 25, 2020 at 12:10:14PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > > > +void sched_core_tag_requeue(struct task_struct *p, unsigned long 
> > > > cookie, bool group)
> > > > +{
> > > > +   if (!p)
> > > > +   return;
> > > > +
> > > > +   if (group)
> > > > +   p->core_group_cookie = cookie;
> > > > +   else
> > > > +   p->core_task_cookie = cookie;
> > > > +
> > > > +   /* Use up half of the cookie's bits for task cookie and 
> > > > remaining for group cookie. */
> > > > +   p->core_cookie = (p->core_task_cookie <<
> > > > +   (sizeof(unsigned long) * 4)) + 
> > > > p->core_group_cookie;
> > > 
> > > This seems dangerous; afaict there is nothing that prevents cookie
> > > collision.
> > 
> > This is fixed in a later patch by Josh "sched: Refactor core cookie into
> > struct" where we are having independent fields for each type of cookie.
> 
> So I don't think that later patch is right... That is, it works, but
> afaict it's massive overkill.
> 
>   COOKIE_CMP_RETURN(task_cookie);
>   COOKIE_CMP_RETURN(group_cookie);
>   COOKIE_CMP_RETURN(color);
> 
> So if task_cookie matches, we consider group_cookie, if that matches we
> consider color.

Yes, the cookie is a compound one. A cookie structure is created here which
has all 3 components. The final cookie value is a pointer to this compound
structure.

> 
> Now, afaict that's semantically exactly the same as just using the
> narrowest cookie. That is, use the task cookie if there is, and then,
> walking the cgroup hierarchy (up) pick the first cgroup cookie.
[..]
> Which means you only need a single active cookie field.

Its not the same. The "compounded" cookie is needed to enforce the CGroup
delegation model. This is exactly how both uclamp and cpusets work. For
uclamp, if we take uclamp.min as an example, if the CGroup sets the
uclamp.min of a group to 0, then even if you use the per-task interface
(sched_setattr) for setting the task's clamp - the "effective uclamp.min" of
the task as seen in /proc/pid/sched will still be 0.

Similar thing here, if 2 tasks belong to 2 different CGroups and each group is
tagged with its own tag, then if you use the per-task interface to make the 2
tasks share a core, the "effective" sharing is still such that the tasks will
not share a core -- because the CGroup decided to make it so.

I would like to really maintain this model. Doing it any other way is
confusing - we have already discussed doing it this way before. With that way
you end up failing one interface if another one was already used.  Been
there, done that. It sucks a lot.

> IOW, you're just making things complicated and expensive.

The cost of the additional comparisons you mentioned is only in the slow path
(i.e. when someone joins or leaves a group). Once the task_struct's cookie
field is set, the cost is not any more than what it was before.

Any other thoughts?

thanks,

 - Joel



Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-06 Thread Joel Fernandes
On Wed, Dec 02, 2020 at 04:47:17PM -0500, Chris Hyser wrote:
> On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> > Add a per-thread core scheduling interface which allows a thread to share a
> > core with another thread, or have a core exclusively for itself.
> > 
> > ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
> > down the keypress latency in Google docs from 150ms to 50ms while improving
> > the camera streaming frame rate by ~3%.
> > 
> 
> Inline is a patch for comment to extend this interface to make it more useful.
> This patch would still need to provide doc and selftests updates as well.
> 
> -chrish
> 
> ---8<---
[..]  
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 217b048..f8e4e96 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -250,5 +250,8 @@ struct prctl_mm_map {
>  
>  /* Request the scheduler to share a core */
>  #define PR_SCHED_CORE_SHARE  59
> +# define PR_SCHED_CORE_CLEAR 0  /* clear core_sched cookie of pid */
> +# define PR_SCHED_CORE_SHARE_FROM1  /* get core_sched cookie from pid */
> +# define PR_SCHED_CORE_SHARE_TO  2  /* push core_sched cookie to 
> pid */
>  
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
> index 800c0f8..14feac1 100644
> --- a/kernel/sched/coretag.c
> +++ b/kernel/sched/coretag.c
> @@ -9,6 +9,7 @@
>   */
>  
>  #include "sched.h"
> +#include "linux/prctl.h"
>  
>  /*
>   * Wrapper representing a complete cookie. The address of the cookie is used 
> as
> @@ -456,40 +457,45 @@ int sched_core_share_tasks(struct task_struct *t1, 
> struct task_struct *t2)
>  }
>  
>  /* Called from prctl interface: PR_SCHED_CORE_SHARE */
> -int sched_core_share_pid(pid_t pid)
> +int sched_core_share_pid(unsigned long flags, pid_t pid)
>  {
> + struct task_struct *dest;
> + struct task_struct *src;
>   struct task_struct *task;
>   int err;
>  
> - if (pid == 0) { /* Recent current task's cookie. */
> - /* Resetting a cookie requires privileges. */
> - if (current->core_task_cookie)
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> - task = NULL;
> - } else {
> - rcu_read_lock();
> - task = pid ? find_task_by_vpid(pid) : current;
> - if (!task) {
> - rcu_read_unlock();
> - return -ESRCH;
> - }
> -
> - get_task_struct(task);
> -
> - /*
> -  * Check if this process has the right to modify the specified
> -  * process. Use the regular "ptrace_may_access()" checks.
> -  */
> - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> - rcu_read_unlock();
> - err = -EPERM;
> - goto out;
> - }
> + rcu_read_lock();
> + task = find_task_by_vpid(pid);
> + if (!task) {
>   rcu_read_unlock();
> + return -ESRCH;
>   }
>  
> - err = sched_core_share_tasks(current, task);
> + get_task_struct(task);
> +
> + /*
> +  * Check if this process has the right to modify the specified
> +  * process. Use the regular "ptrace_may_access()" checks.
> +  */
> + if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> + rcu_read_unlock();
> + err = -EPERM;
> + goto out;
> + }
> + rcu_read_unlock();
> +
> + if (flags == PR_SCHED_CORE_CLEAR) {
> + dest = task;
> + src = NULL;
> + } else if (flags == PR_SCHED_CORE_SHARE_TO) {
> + dest = task;
> + src = current;
> + } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
> + dest = current;
> + src = task;
> + }

Looks ok to me except the missing else { } clause you found. Also, maybe
dest/src can be renamed to from/to to make meaning of variables more clear?

Also looking forward to the docs/test updates.

thanks!

 - Joel


> +
> + err = sched_core_share_tasks(dest, src);
>  out:
>   if (task)
>   put_task_struct(task);
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index cffdfab..50c31f3 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct 
> pid_names

Re: [PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-12-02 Thread Joel Fernandes
On Tue, Dec 01, 2020 at 08:21:43PM -0800, Paul E. McKenney wrote:
> On Tue, Dec 01, 2020 at 05:26:32PM -0500, Joel Fernandes wrote:
> > On Thu, Nov 19, 2020 at 3:42 PM Joel Fernandes  
> > wrote:
> > >
> > > On Thu, Nov 19, 2020 at 12:16:15PM -0800, Paul E. McKenney wrote:
> > > > On Thu, Nov 19, 2020 at 02:44:35PM -0500, Joel Fernandes wrote:
> > > > > On Thu, Nov 19, 2020 at 2:22 PM Paul E. McKenney  
> > > > > wrote:
> > > > > > > > > > On Wed, Nov 18, 2020 at 11:15:41AM -0500, Joel Fernandes 
> > > > > > > > > > (Google) wrote:
> > > > > > > > > > > After rcu_do_batch(), add a check for whether the seglen 
> > > > > > > > > > > counts went to
> > > > > > > > > > > zero if the list was indeed empty.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > > > > > > 
> > > > > > > > > >
> > > > > > > > > > Queued for testing and further review, thank you!
> > > > > > > > >
> > > > > > > > > FYI, the second of the two checks triggered in all four 
> > > > > > > > > one-hour runs of
> > > > > > > > > TREE01, all four one-hour runs of TREE04, and one of the four 
> > > > > > > > > one-hour
> > > > > > > > > runs of TREE07.  This one:
> > > > > > > > >
> > > > > > > > > WARN_ON_ONCE(count != 0 && 
> > > > > > > > > rcu_segcblist_n_segment_cbs(>cblist) == 0);
> > > > > > > > >
> > > > > > > > > That is, there are callbacks in the list, but the sum of the 
> > > > > > > > > segment
> > > > > > > > > counts is nevertheless zero.  The ->nocb_lock is held.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > FWIW, TREE01 reproduces it very quickly compared to the other 
> > > > > > > > two
> > > > > > > > scenarios, on all four run, within five minutes.
> > > > > > >
> > > > > > > So far for TREE01, I traced it down to an rcu_barrier happening 
> > > > > > > so it could
> > > > > > > be related to some interaction with rcu_barrier() (Just a guess).
> > > > > >
> > > > > > Well, rcu_barrier() and srcu_barrier() are the only users of
> > > > > > rcu_segcblist_entrain(), if that helps.  Your modification to that
> > > > > > function looks plausible to me, but the system's opinion always 
> > > > > > overrules
> > > > > > mine.  ;-)
> > > > >
> > > > > Right. Does anything the bypass code standout? That happens during
> > > > > rcu_barrier() as well, and it messes with the lengths.
> > > >
> > > > In theory, rcu_barrier_func() flushes the bypass before doing the
> > > > entrain, and does the rcu_segcblist_entrain() afterwards.
> > > >
> > > > Ah, and that is the issue.  If ->cblist is empty and ->nocb_bypass
> > > > is not, then ->cblist length will be nonzero, and none of the
> > > > segments will be nonzero.
> > > >
> > > > So you need something like this for that second WARN, correct?
> > > >
> > > >   WARN_ON_ONCE(!rcu_segcblist_empty(>cblist) &&
> > > >rcu_segcblist_n_segment_cbs(>cblist) == 0);
> > 
> > Just started to look into it again. If the >cblist is empty, that
> > means the bypass list could not have been used (Since per comments on
> > rcu_nocb_try_bypass() , the bypass list is in use only when the cblist
> > is non-empty). So the cblist was non empty, then the segment counts
> > should not sum to 0.  So I don't think that explains it. Anyway, I
> > will try the new version of your warning in case there is something
> > about bypass lists that I'm missing.
> 
> Good point.  I really did see failures, though.  Do they show up for
> you?

Yeah I do see failures. Once I change the warning as below, the failures go
away though. So looks like indeed a segcblist can be empty when the bypass
list has something in it?  If you agree, could you change the warning to as
below? The tests failing before all pass 1 hour rcutorture testing now
(TREE01, TREE04 and TREE07).

---8<---

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 91e35b521e51..3cd92b7df8ac 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2554,7 +2554,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
 count != 0 && rcu_segcblist_empty(>cblist));
WARN_ON_ONCE(count == 0 && rcu_segcblist_n_segment_cbs(>cblist) != 
0);
-   WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(>cblist) == 
0);
+   WARN_ON_ONCE(!rcu_segcblist_empty(>cblist) &&
+rcu_segcblist_n_segment_cbs(>cblist) == 0);
 
rcu_nocb_unlock_irqrestore(rdp, flags);
 
-- 
2.29.2.454.gaff20da3a2-goog



Re: [PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-12-01 Thread Joel Fernandes
On Thu, Nov 19, 2020 at 3:42 PM Joel Fernandes  wrote:
>
> On Thu, Nov 19, 2020 at 12:16:15PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 19, 2020 at 02:44:35PM -0500, Joel Fernandes wrote:
> > > On Thu, Nov 19, 2020 at 2:22 PM Paul E. McKenney  
> > > wrote:
> > > > > > > > On Wed, Nov 18, 2020 at 11:15:41AM -0500, Joel Fernandes 
> > > > > > > > (Google) wrote:
> > > > > > > > > After rcu_do_batch(), add a check for whether the seglen 
> > > > > > > > > counts went to
> > > > > > > > > zero if the list was indeed empty.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > > > > 
> > > > > > > >
> > > > > > > > Queued for testing and further review, thank you!
> > > > > > >
> > > > > > > FYI, the second of the two checks triggered in all four one-hour 
> > > > > > > runs of
> > > > > > > TREE01, all four one-hour runs of TREE04, and one of the four 
> > > > > > > one-hour
> > > > > > > runs of TREE07.  This one:
> > > > > > >
> > > > > > > WARN_ON_ONCE(count != 0 && 
> > > > > > > rcu_segcblist_n_segment_cbs(>cblist) == 0);
> > > > > > >
> > > > > > > That is, there are callbacks in the list, but the sum of the 
> > > > > > > segment
> > > > > > > counts is nevertheless zero.  The ->nocb_lock is held.
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > FWIW, TREE01 reproduces it very quickly compared to the other two
> > > > > > scenarios, on all four run, within five minutes.
> > > > >
> > > > > So far for TREE01, I traced it down to an rcu_barrier happening so it 
> > > > > could
> > > > > be related to some interaction with rcu_barrier() (Just a guess).
> > > >
> > > > Well, rcu_barrier() and srcu_barrier() are the only users of
> > > > rcu_segcblist_entrain(), if that helps.  Your modification to that
> > > > function looks plausible to me, but the system's opinion always 
> > > > overrules
> > > > mine.  ;-)
> > >
> > > Right. Does anything the bypass code standout? That happens during
> > > rcu_barrier() as well, and it messes with the lengths.
> >
> > In theory, rcu_barrier_func() flushes the bypass before doing the
> > entrain, and does the rcu_segcblist_entrain() afterwards.
> >
> > Ah, and that is the issue.  If ->cblist is empty and ->nocb_bypass
> > is not, then ->cblist length will be nonzero, and none of the
> > segments will be nonzero.
> >
> > So you need something like this for that second WARN, correct?
> >
> >   WARN_ON_ONCE(!rcu_segcblist_empty(>cblist) &&
> >rcu_segcblist_n_segment_cbs(>cblist) == 0);

Just started to look into it again. If the >cblist is empty, that
means the bypass list could not have been used (Since per comments on
rcu_nocb_try_bypass() , the bypass list is in use only when the cblist
is non-empty). So the cblist was non empty, then the segment counts
should not sum to 0.  So I don't think that explains it. Anyway, I
will try the new version of your warning in case there is something
about bypass lists that I'm missing.


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-01 Thread Joel Fernandes
Hi Peter,

On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> > From: Josh Don 
> > 
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added 
> > which
> > is writeable by unprivileged users.
> > 
> > Google uses DAC controls to make the 'tag' possible to set only by root 
> > while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> > 
> > The hierarchy looks like:
> > 
> > Root group
> >/ \
> >   A   B(These are created by the root daemon - borglet).
> >  / \   \
> > C   D   E  (These are created by AppEngine within the container).
> > 
> > The reason why Google has two parts is that AppEngine wants to allow a 
> > subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups 
> > are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> 
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.

There's 2 parts that Google's usecase has. The first part is set by a
privileged process, and the second part (color) is set within the container.
Maybe we can just put the "color" feature behind a CONFIG option for Google
to enable?

> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.

Ok, the idea was to use stop_machine() as in your initial patch. It works
quite well in testing. However I agree with its horrible we ought to do
better (or at least try).

Maybe we can do a synchronize_rcu() after changing cookie, to ensure we are
no longer using the old cookie value in the scheduler.

> At best you now have the requirements sorted.

Yes.

thanks,

 - Joel



Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 02:08:08PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> > +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> > +int sched_core_share_pid(pid_t pid)
> > +{
> > +   struct task_struct *task;
> > +   int err;
> > +
> > +   if (pid == 0) { /* Recent current task's cookie. */
> > +   /* Resetting a cookie requires privileges. */
> > +   if (current->core_task_cookie)
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EPERM;
> 
> Coding-Style fail.
> 
> Also, why?!? I realize it is true for your case, because hardware fail.
> But in general this just isn't true. This wants to be some configurable
> policy.

True. I think me and you discussed eons ago though that it needs to
privileged.  For our case, actually we use seccomp so we don't let
untrusted task set a cookie anyway, let alone reset it. We do it before we
enter the seccomp sandbox. So we don't really need this security check here.

Since you dislike this part of the patch, I am Ok with just dropping it as
below:

---8<---

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 8fce3f4b7cae..9b587a1245f5 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -443,11 +443,7 @@ int sched_core_share_pid(pid_t pid)
struct task_struct *task;
int err;
 
-   if (pid == 0) { /* Recent current task's cookie. */
-   /* Resetting a cookie requires privileges. */
-   if (current->core_task_cookie)
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
+   if (!pid) { /* Reset current task's cookie. */
task = NULL;
} else {
rcu_read_lock();


Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 12:10:14PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, 
> > bool group)
> > +{
> > +   if (!p)
> > +   return;
> > +
> > +   if (group)
> > +   p->core_group_cookie = cookie;
> > +   else
> > +   p->core_task_cookie = cookie;
> > +
> > +   /* Use up half of the cookie's bits for task cookie and remaining for 
> > group cookie. */
> > +   p->core_cookie = (p->core_task_cookie <<
> > +   (sizeof(unsigned long) * 4)) + 
> > p->core_group_cookie;
> 
> This seems dangerous; afaict there is nothing that prevents cookie
> collision.

This is fixed in a later patch by Josh "sched: Refactor core cookie into
struct" where we are having independent fields for each type of cookie.

I'll squash it next time I post to prevent confusion. Thanks,

 - Joel



Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 12:11:28PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> 
> > + * sched_core_tag_requeue - Common helper for all interfaces to set a 
> > cookie.
> 
> sched_core_set_cookie() would be a saner name, given that description,
> don't you think?

Yeah, Josh is better than me at naming so he changed it to
sched_core_update_cookie() already :-). Hopefully that's Ok too with you.

thanks,

 - Joel



Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 12:15:41PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> 
> > +/*
> > + * Ensure that the task has been requeued. The stopper ensures that the 
> > task cannot
> > + * be migrated to a different CPU while its core scheduler queue state is 
> > being updated.
> > + * It also makes sure to requeue a task if it was running actively on 
> > another CPU.
> > + */
> > +static int sched_core_task_join_stopper(void *data)
> > +{
> > +   struct sched_core_task_write_tag *tag = (struct 
> > sched_core_task_write_tag *)data;
> > +   int i;
> > +
> > +   for (i = 0; i < 2; i++)
> > +   sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* 
> > !group */);
> > +
> > +   return 0;
> > +}
> > +
> > +static int sched_core_share_tasks(struct task_struct *t1, struct 
> > task_struct *t2)
> > +{
> 
> > +   stop_machine(sched_core_task_join_stopper, (void *), NULL);
> 
> > +}
> 
> This is *REALLY* terrible...

I pulled this bit from your original patch. Are you concerned about the
stop_machine? Sharing a core is a slow path for our usecases (and as far as I
know, for everyone else's). We can probably do something different if that
requirement changes.



Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 12:07:09PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > Also, for the per-task cookie, it will get weird if we use pointers of any
> > emphemeral objects. For this reason, introduce a refcounted object who's 
> > sole
> > purpose is to assign unique cookie value by way of the object's pointer.
> 
> Might be useful to explain why exactly none of the many pid_t's are
> good enough.

I thought about this already and it does not seem a good fit. When 2
processes share, it is possible that more processes are added to that logical
group. Then the original processes that share die, but if we hold on to the
pid_t or task_struct, that would be awkward. It seemed introducing a new
refcounted struct was the right way to go. I can add these details to the
change log.

thanks!

 - Joel



Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 02:03:22PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > +static bool sched_core_get_task_cookie(unsigned long cookie)
> > +{
> > +   struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> > +
> > +   /*
> > +* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> > +* is done after the stopper runs.
> > +*/
> > +   sched_core_get();
> > +   return refcount_inc_not_zero(>refcnt);
> 
> See below, but afaict this should be refcount_inc().

Fully agreed with all these. Updated with diff as below. Will test further
and post next version soon. Thanks!

---8<---

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 2fb5544a4a18..8fce3f4b7cae 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -288,12 +288,12 @@ static unsigned long sched_core_alloc_task_cookie(void)
return (unsigned long)ck;
 }
 
-static bool sched_core_get_task_cookie(unsigned long cookie)
+static void sched_core_get_task_cookie(unsigned long cookie)
 {
struct sched_core_task_cookie *ptr =
(struct sched_core_task_cookie *)cookie;
 
-   return refcount_inc_not_zero(>refcnt);
+   refcount_inc(>refcnt);
 }
 
 static void sched_core_put_task_cookie(unsigned long cookie)
@@ -392,10 +392,7 @@ int sched_core_share_tasks(struct task_struct *t1, struct 
task_struct *t2)
sched_core_get(); /* For the alloc. */
 
/* Add another reference for the other task. */
-   if (!sched_core_get_task_cookie(cookie)) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   sched_core_get_task_cookie(cookie);
sched_core_get(); /* For the other task. */
 
wr.tasks[0] = t1;
@@ -411,10 +408,7 @@ int sched_core_share_tasks(struct task_struct *t1, struct 
task_struct *t2)
 
} else if (!t1->core_task_cookie && t2->core_task_cookie) {
/* CASE 3. */
-   if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   sched_core_get_task_cookie(t2->core_task_cookie);
sched_core_get();
 
wr.tasks[0] = t1;
@@ -422,10 +416,7 @@ int sched_core_share_tasks(struct task_struct *t1, struct 
task_struct *t2)
 
} else {
/* CASE 4. */
-   if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   sched_core_get_task_cookie(t2->core_task_cookie);
sched_core_get();
 
sched_core_put_task_cookie(t1->core_task_cookie);
-- 
2.29.2.454.gaff20da3a2-goog



Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 01:54:47PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > +/* Per-task interface */
> > +static unsigned long sched_core_alloc_task_cookie(void)
> > +{
> > +   struct sched_core_cookie *ptr =
> > +   kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> > +
> > +   if (!ptr)
> > +   return 0;
> > +   refcount_set(>refcnt, 1);
> > +
> > +   /*
> > +* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> > +* is done after the stopper runs.
> > +*/
> > +   sched_core_get();
> > +   return (unsigned long)ptr;
> > +}
> > +
> > +static bool sched_core_get_task_cookie(unsigned long cookie)
> > +{
> > +   struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> > +
> > +   /*
> > +* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> > +* is done after the stopper runs.
> > +*/
> > +   sched_core_get();
> > +   return refcount_inc_not_zero(>refcnt);
> > +}
> > +
> > +static void sched_core_put_task_cookie(unsigned long cookie)
> > +{
> > +   struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> > +
> > +   if (refcount_dec_and_test(>refcnt))
> > +   kfree(ptr);
> > +}
> 
> > +   /*
> > +* NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> > +*   sched_core_put_task_cookie(). However, sched_core_put() is done
> > +*   by this function *after* the stopper removes the tasks from the
> > +*   core queue, and not before. This is just to play it safe.
> > +*/
> 
> So for no reason what so ever you've made the code more difficult?

You're right, I could just do sched_core_get() in the caller. I changed it as in
the diff below:

---8<---

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 800c0f8bacfc..75e2edb53a48 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -274,6 +274,7 @@ void sched_core_change_group(struct task_struct *p, struct 
task_group *new_tg)
 /* Per-task interface: Used by fork(2) and prctl(2). */
 static void sched_core_put_cookie_work(struct work_struct *ws);
 
+/* Caller has to call sched_core_get() if non-zero value is returned. */
 static unsigned long sched_core_alloc_task_cookie(void)
 {
struct sched_core_task_cookie *ck =
@@ -284,11 +285,6 @@ static unsigned long sched_core_alloc_task_cookie(void)
refcount_set(>refcnt, 1);
INIT_WORK(>work, sched_core_put_cookie_work);
 
-   /*
-* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
-* is done after the stopper runs.
-*/
-   sched_core_get();
return (unsigned long)ck;
 }
 
@@ -354,12 +350,6 @@ int sched_core_share_tasks(struct task_struct *t1, struct 
task_struct *t2)
 
mutex_lock(_core_tasks_mutex);
 
-   /*
-* NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
-*   sched_core_put_task_cookie(). However, sched_core_put() is done
-*   by this function *after* the stopper removes the tasks from the
-*   core queue, and not before. This is just to play it safe.
-*/
if (!t2) {
if (t1->core_task_cookie) {
sched_core_put_task_cookie(t1->core_task_cookie);
@@ -370,7 +360,9 @@ int sched_core_share_tasks(struct task_struct *t1, struct 
task_struct *t2)
/* Assign a unique per-task cookie solely for t1. */
 
cookie = sched_core_alloc_task_cookie();
-   if (!cookie)
+   if (cookie)
+   sched_core_get();
+   else
goto out_unlock;
 
if (t1->core_task_cookie) {
@@ -401,7 +393,9 @@ int sched_core_share_tasks(struct task_struct *t1, struct 
task_struct *t2)
 
/* CASE 1. */
cookie = sched_core_alloc_task_cookie();
-   if (!cookie)
+   if (cookie)
+   sched_core_get();
+   else
goto out_unlock;
 
/* Add another reference for the other task. */


Re: [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 09:49:08AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 24, 2020 at 01:03:43PM -0500, Joel Fernandes wrote:
> > On Tue, Nov 24, 2020 at 05:13:35PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 17, 2020 at 06:19:49PM -0500, Joel Fernandes (Google) wrote:
> 
> > > > +static inline void generic_idle_enter(void)
> > > > +static inline void generic_idle_exit(void)
> 
> > > That naming is terrible..
> > 
> > Yeah sorry :-\. The naming I chose was to be aligned with the
> > CONFIG_GENERIC_ENTRY naming. I am open to ideas on that.
> 
> entry_idle_{enter,exit}() ?

Sounds good to me.

> > > I'm confused.. arch_cpu_idle_{enter,exit}() weren't conveniently placed
> > > for you?
> > 
> > The way this patch series works, it does not depend on arch code as much as
> > possible. Since there are other arch that may need this patchset such as 
> > ARM,
> > it may be better to keep it in the generic entry code.  Thoughts?
> 
> I didn't necessarily mean using those hooks, even placing your new hooks
> right next to them would've covered the exact same code with less lines
> modified.

Ok sure. I will improve it this way for next posting.

thanks,

 - Joel



Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode

2020-12-01 Thread Joel Fernandes
On Wed, Nov 25, 2020 at 10:37:00AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> > 
> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> 
> >  .../admin-guide/kernel-parameters.txt |  11 +
> >  include/linux/entry-common.h  |  12 +-
> >  include/linux/sched.h |  12 +
> >  kernel/entry/common.c |  28 +-
> >  kernel/sched/core.c   | 241 ++
> >  kernel/sched/sched.h  |   3 +
> >  6 files changed, 304 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index bd1a5b87a5e2..b185c6ed4aba 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,17 @@
> >  
> > sbni=   [NET] Granch SBNI12 leased line adapter
> >  
> > +   sched_core_protect_kernel=
> > +   [SCHED_CORE] Pause SMT siblings of a core running in
> > +   user mode, if at least one of the siblings of the core
> > +   is running in kernel mode. This is to guarantee that
> > +   kernel data is not leaked to tasks which are not trusted
> > +   by the kernel. A value of 0 disables protection, 1
> > +   enables protection. The default is 1. Note that 
> > protection
> > +   depends on the arch defining the _TIF_UNSAFE_RET flag.
> > +   Further, for protecting VMEXIT, arch needs to call
> > +   KVM entry/exit hooks.
> > +
> > sched_debug [KNL] Enables verbose scheduler debug messages.
> >  
> > schedstats= [KNL,X86] Enable or disable scheduled statistics.
> 
> So I don't like the parameter name, it's too long. Also I don't like it
> because its a boolean.

Maybe ht_protect= then?

> You're adding syscall,irq,kvm under a single knob where they're all due
> to different flavours of broken. Different hardware might want/need
> different combinations.

Ok, I can try to make it ht_protect=irq,syscall,kvm etc. And conditionally
enable the protection. Does that work for you?
> 
> Hardware without MDS but with L1TF wouldn't need the syscall hook, but
> you're not givng a choice here. And this is generic code, you can't
> assume stuff like this.

Got it.

thanks,

 - Joel



Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling

2020-12-01 Thread Joel Fernandes
On Thu, Nov 26, 2020 at 10:05:19AM +1100, Balbir Singh wrote:
> On Tue, Nov 24, 2020 at 01:30:38PM -0500, Joel Fernandes wrote:
> > On Mon, Nov 23, 2020 at 09:41:23AM +1100, Balbir Singh wrote:
> > > On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> > > > From: Peter Zijlstra 
> > > > 
> > > > The rationale is as follows. In the core-wide pick logic, even if
> > > > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > > > see if they could be running RT.
> > > > 
> > > > Say the RQs in a particular core look like this:
> > > > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > > > 
> > > > rq0rq1
> > > > CFS1 (tagged)  RT1 (not tag)
> > > > CFS2 (tagged)
> > > > 
> > > > Say schedule() runs on rq0. Now, it will enter the above loop and
> > > > pick_task(RT) will return NULL for 'p'. It will enter the above if() 
> > > > block
> > > > and see that need_sync == false and will skip RT entirely.
> > > > 
> > > > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > > > rq0 rq1
> > > > CFS1IDLE
> > > > 
> > > > When it should have selected:
> > > > rq0 r1
> > > > IDLERT
> > > > 
> > > > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > > > gets constantly force-idled and breaks RT. Lets cure it.
> > > > 
> > > > NOTE: This problem will be fixed differently in a later patch. It just
> > > >   kept here for reference purposes about this issue, and to make
> > > >   applying later patches easier.
> > > >
> > > 
> > > The changelog is hard to read, it refers to above if(), whereas there
> > > is no code snippet in the changelog.
> > 
> > Yeah sorry, it comes from this email where I described the issue:
> > http://lore.kernel.org/r/20201023175724.ga3563...@google.com
> > 
> > I corrected the changelog and appended the patch below. Also pushed it to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
> > 
> > > Also, from what I can see following
> > > the series, p->core_cookie is not yet set anywhere (unless I missed it),
> > > so fixing it in here did not make sense just reading the series.
> > 
> > The interface patches for core_cookie are added later, that's how it is. The
> > infrastructure comes first here. It would also not make sense to add
> > interface first as well so I think the current ordering is fine.
> >
> 
> Some comments below to help make the code easier to understand
> 
> > ---8<---
> > 
> > From: Peter Zijlstra 
> > Subject: [PATCH] sched: Fix priority inversion of cookied task with sibling
> > 
> > The rationale is as follows. In the core-wide pick logic, even if
> > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > see if they could be running RT.
> > 
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > 
> > rq0rq1
> > CFS1 (tagged)  RT1 (not tag)
> > CFS2 (tagged)
> > 
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0     rq1
> > CFS1IDLE
> > 
> > When it should have selected:
> > rq0 r1
> > IDLERT
> > 
> > Fix this issue by forcing need_sync and restarting the search if a
> > cookied task was discovered. This will avoid this optimization from
> > making incorrect picks.
> > 
> > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > gets constantly force-idled and breaks RT. Lets cure it.
> > 
> > NOTE: This problem will be fixed differently in a later patch. It just
> >   kept here for reference purposes about this issue, and to make
> >   applying later patches easier.
> > 
> > Reported-by: Joel Fernandes (Google) 
> > Signed-off-by: Peter Zijlstra 
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >  kernel/sched/core.c | 25 -
> >  1 file changed, 16 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ee4902c2cf5..53af817740c0 100644
&

Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling

2020-11-24 Thread Joel Fernandes
On Mon, Nov 23, 2020 at 09:41:23AM +1100, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra 
> > 
> > The rationale is as follows. In the core-wide pick logic, even if
> > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > see if they could be running RT.
> > 
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > 
> > rq0rq1
> > CFS1 (tagged)  RT1 (not tag)
> > CFS2 (tagged)
> > 
> > Say schedule() runs on rq0. Now, it will enter the above loop and
> > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > and see that need_sync == false and will skip RT entirely.
> > 
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0 rq1
> > CFS1IDLE
> > 
> > When it should have selected:
> > rq0 r1
> > IDLERT
> > 
> > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > gets constantly force-idled and breaks RT. Lets cure it.
> > 
> > NOTE: This problem will be fixed differently in a later patch. It just
> >   kept here for reference purposes about this issue, and to make
> >   applying later patches easier.
> >
> 
> The changelog is hard to read, it refers to above if(), whereas there
> is no code snippet in the changelog.

Yeah sorry, it comes from this email where I described the issue:
http://lore.kernel.org/r/20201023175724.ga3563...@google.com

I corrected the changelog and appended the patch below. Also pushed it to:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched

> Also, from what I can see following
> the series, p->core_cookie is not yet set anywhere (unless I missed it),
> so fixing it in here did not make sense just reading the series.

The interface patches for core_cookie are added later, that's how it is. The
infrastructure comes first here. It would also not make sense to add
interface first as well so I think the current ordering is fine.

---8<---

From: Peter Zijlstra 
Subject: [PATCH] sched: Fix priority inversion of cookied task with sibling

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs) to
see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0rq1
CFS1 (tagged)  RT1 (not tag)
CFS2 (tagged)

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0 rq1
CFS1IDLE

When it should have selected:
rq0 r1
IDLERT

Fix this issue by forcing need_sync and restarting the search if a
cookied task was discovered. This will avoid this optimization from
making incorrect picks.

Joel saw this issue on real-world usecases in ChromeOS where an RT task
gets constantly force-idled and breaks RT. Lets cure it.

NOTE: This problem will be fixed differently in a later patch. It just
  kept here for reference purposes about this issue, and to make
  applying later patches easier.

Reported-by: Joel Fernandes (Google) 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ee4902c2cf5..53af817740c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
need_sync = !!rq->core->core_cookie;
 
/* reset state */
+reset:
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
@@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
/*
 * If there weren't no cookies; we don't need to
 * bother with the other siblings.
-* If the rest of the core is not running a 
tagged
-* task, i.e.  need_sync == 0, and the current 
CPU
-* which called into the schedule() loop does 
not
-* have any tasks for this class, skip 
selecting for
-* other siblings since there's no point. We 
don't skip
-* for RT/DL because that could make CFS 
force-idl

Re: [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit

2020-11-24 Thread Joel Fernandes
On Tue, Nov 24, 2020 at 05:13:35PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:49PM -0500, Joel Fernandes (Google) wrote:
> > Add a generic_idle_{enter,exit} helper function to enter and exit kernel
> > protection when entering and exiting idle, respectively.
> > 
> > While at it, remove a stale RCU comment.
> > 
> > Reviewed-by: Alexandre Chartre 
> > Tested-by: Julien Desfossez 
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >  include/linux/entry-common.h | 18 ++
> >  kernel/sched/idle.c  | 11 ++-
> >  2 files changed, 24 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 022e1f114157..8f34ae625f83 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
> > return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
> > && _TIF_UNSAFE_RET != 0;
> >  }
> > +
> > +/**
> > + * generic_idle_enter - General tasks to perform during idle entry.
> > + */
> > +static inline void generic_idle_enter(void)
> > +{
> > +   /* Entering idle ends the protected kernel region. */
> > +   sched_core_unsafe_exit();
> > +}
> > +
> > +/**
> > + * generic_idle_exit  - General tasks to perform during idle exit.
> > + */
> > +static inline void generic_idle_exit(void)
> > +{
> > +   /* Exiting idle (re)starts the protected kernel region. */
> > +   sched_core_unsafe_enter();
> > +}
> >  #endif
> 
> That naming is terrible..

Yeah sorry :-\. The naming I chose was to be aligned with the
CONFIG_GENERIC_ENTRY naming. I am open to ideas on that.

> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 8bdb214eb78f..ee4f91396c31 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -8,6 +8,7 @@
> >   */
> >  #include "sched.h"
> >  
> > +#include 
> >  #include 
> >  
> >  /* Linker adds these: start and end of __cpuidle functions */
> > @@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
> >  
> >  static noinline int __cpuidle cpu_idle_poll(void)
> >  {
> > +   generic_idle_enter();
> > trace_cpu_idle(0, smp_processor_id());
> > stop_critical_timings();
> > rcu_idle_enter();
> > @@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
> > rcu_idle_exit();
> > start_critical_timings();
> > trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> > +   generic_idle_exit();
> >  
> > return 1;
> >  }
> > @@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
> > return;
> > }
> >  
> > -   /*
> > -* The RCU framework needs to be told that we are entering an idle
> > -* section, so no more rcu read side critical sections and one more
> > -* step to the grace period
> > -*/
> > +   generic_idle_enter();
> >  
> > if (cpuidle_not_available(drv, dev)) {
> > tick_nohz_idle_stop_tick();
> > @@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
> >  */
> > if (WARN_ON_ONCE(irqs_disabled()))
> > local_irq_enable();
> > +
> > +   generic_idle_exit();
> >  }
> 
> I'm confused.. arch_cpu_idle_{enter,exit}() weren't conveniently placed
> for you?

The way this patch series works, it does not depend on arch code as much as
possible. Since there are other arch that may need this patchset such as ARM,
it may be better to keep it in the generic entry code.  Thoughts?

thanks,

 - Joel



Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode

2020-11-24 Thread Joel Fernandes
Hi Peter,

On Tue, Nov 24, 2020 at 05:09:06PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> > 
> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> 
> Oh gawd; this is horrible...

Yeah, its the same issue we discussed earlier this year :)

> > +bool sched_core_wait_till_safe(unsigned long ti_check)
> > +{
> > +   bool restart = false;
> > +   struct rq *rq;
> > +   int cpu;
> > +
> > +   /* We clear the thread flag only at the end, so no need to check for 
> > it. */
> > +   ti_check &= ~_TIF_UNSAFE_RET;
> > +
> > +   cpu = smp_processor_id();
> > +   rq = cpu_rq(cpu);
> > +
> > +   if (!sched_core_enabled(rq))
> > +   goto ret;
> > +
> > +   /* Down grade to allow interrupts to prevent stop_machine lockups.. */
> > +   preempt_disable();
> > +   local_irq_enable();
> > +
> > +   /*
> > +* Wait till the core of this HT is not in an unsafe state.
> > +*
> > +* Pair with raw_spin_lock/unlock() in sched_core_unsafe_enter/exit().
> > +*/
> > +   while (smp_load_acquire(>core->core_unsafe_nest) > 0) {
> > +   cpu_relax();
> > +   if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> > +   restart = true;
> > +   break;
> > +   }
> > +   }
> 
> What's that ACQUIRE for?

I was concerned about something like below for weakly-ordered arch:

Kernel  Attacker
--  
write unsafe=1

kernel code does stores while (unsafe == 1); (ACQUIRE)
   ^ ^
   | needs to be ordered | needs to be ordered
   v v
write unsafe=0 (RELEASE)Attacker code.


Here, I want the access made by kernel code to be ordered WRT the write to
the unsafe nesting counter variable, so that the attacker code does not see
those accesses later.

It could be argued its a theoretical concern, but I wanted to play it safe. In
particular, the existing uarch buffer flushing before entering the Attacker
code might make it sufficiently impossible for Attacker to do anything bad
even without the additional memory barriers.

> > +
> > +   /* Upgrade it back to the expectations of entry code. */
> > +   local_irq_disable();
> > +   preempt_enable();
> > +
> > +ret:
> > +   if (!restart)
> > +   clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +   return restart;
> > +}
> 
> So if TIF_NEED_RESCHED gets set, we'll break out and reschedule, cute.

Thanks.

> > +   /* Do nothing more if the core is not tagged. */
> > +   if (!rq->core->core_cookie)
> > +   goto unlock;
> > +
> > +   for_each_cpu(i, smt_mask) {
> > +   struct rq *srq = cpu_rq(i);
> > +
> > +   if (i == cpu || cpu_is_offline(i))
> > +   continue;
> > +
> > +   if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> > +   continue;
> > +
> > +   /* Skip if HT is not running a tagged task. */
> > +   if (!srq->curr->core_cookie && !srq->core_pick)
> > +   continue;
> > +
> > +   /*
> > +* Force sibling into the kernel by IPI. If work was already
> > +* pending, no new IPIs are sent. This is Ok since the receiver
> > +* would already be in the kernel, or on its way to it.
> > +

Re: [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups

2020-11-24 Thread Joel Fernandes
Hi Peter,

On Tue, Nov 24, 2020 at 11:27:41AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:45PM -0500, Joel Fernandes (Google) wrote:
> > A previous patch improved cross-cpu vruntime comparison opertations in
> > pick_next_task(). Improve it further for tasks in CGroups.
> > 
> > In particular, for cross-CPU comparisons, we were previously going to
> > the root-level se(s) for both the task being compared. That was strange.
> > This patch instead finds the se(s) for both tasks that have the same
> > parent (which may be different from root).
> > 
> > A note about the min_vruntime snapshot and force idling:
> > Abbreviations: fi: force-idled now? ; fib: force-idled before?
> > During selection:
> > When we're not fi, we need to update snapshot.
> > when we're fi and we were not fi, we must update snapshot.
> > When we're fi and we were already fi, we must not update snapshot.
> > 
> > Which gives:
> > fib fi  update?
> > 0   0   1
> > 0   1   1
> > 1   0   1
> > 1   1   0
> > So the min_vruntime snapshot needs to be updated when: !(fib && fi).
> > 
> > Also, the cfs_prio_less() function needs to be aware of whether the core
> > is in force idle or not, since it will be use this information to know
> > whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
> > this information along via pick_task() -> prio_less().
> 
> Hurmph.. so I'm tempted to smash a bunch of patches together.
> 
>  2 <- 3 (already done - bisection crashes are daft)
>  6 <- 11
>  7 <- {10, 12}
>  9 <- 15
> 
> I'm thinking that would result in an easier to read series, or do we
> want to preserve this history?
> 
> (fwiw, I pulled 15 before 13,14, as I think that makes more sense
> anyway).

Either way would be Ok with me, I would suggest retaining the history though
so that the details in the changelog are preserved of the issues we faced,
and in the future we can refer back to them.

thanks,

 - Joel



Re: [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case

2020-11-24 Thread Joel Fernandes
Hi Peter,

On Tue, Nov 24, 2020 at 01:04:38PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:42PM -0500, Joel Fernandes (Google) wrote:
> > +   /*
> > +* Optimize for common case where this CPU has no cookies
> > +* and there are no cookied tasks running on siblings.
> > +*/
> > +   if (!need_sync) {
> > +   for_each_class(class) {
> > +   next = class->pick_task(rq);
> > +   if (next)
> > +   break;
> > +   }
> > +
> > +   if (!next->core_cookie) {
> > +   rq->core_pick = NULL;
> > +   goto done;
> > +   }
> > need_sync = true;
> > }
> 
> This isn't what I send you here:
> 
>   
> https://lkml.kernel.org/r/20201026093131.gf2...@hirez.programming.kicks-ass.net

I had replied to it here with concerns about the effects of newly idle
balancing not being reverseable, it was only a theoretical concern:
http://lore.kernel.org/r/20201105185019.ga2771...@google.com

Also I was trying to keep the logic the same as v8 for unconstrained pick
(calling pick_task), considering that has been tested quite a bit.

> Specifically, you've lost the whole cfs-cgroup optimization.

Are you referring to this optimization in pick_next_task_fair() ?

/*
 * Since we haven't yet done put_prev_entity and if the
 * selected task
 * is a different task than we started out with, try
 * and touch the
 * least amount of cfs_rqs.
 */

You are right, we wouldn't get that with just calling pick_task_fair(). We
did not have this in v8 series either though.

Also, if the task is a cookied task, then I think you are doing more work
with your patch due to the extra put_prev_task().

> What was wrong/not working with the below?

Other than the new idle balancing, IIRC it was also causing instability.
Maybe we can considering this optimization in the future if that's Ok with
you?

thanks,

 - Joel

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5225,8 +5227,6 @@ pick_next_task(struct rq *rq, struct tas
>   return next;
>   }
>  
> - put_prev_task_balance(rq, prev, rf);
> -
>   smt_mask = cpu_smt_mask(cpu);
>   need_sync = !!rq->core->core_cookie;
>  
> @@ -5255,17 +5255,14 @@ pick_next_task(struct rq *rq, struct tas
>* and there are no cookied tasks running on siblings.
>*/
>   if (!need_sync) {
> - for_each_class(class) {
> - next = class->pick_task(rq);
> - if (next)
> - break;
> - }
> -
> + next = __pick_next_task(rq, prev, rf);
>   if (!next->core_cookie) {
>   rq->core_pick = NULL;
> - goto done;
> + return next;
>   }
> - need_sync = true;
> + put_prev_task(next);
> + } else {
> + put_prev_task_balance(rq, prev, rf);
>   }
>  
>   for_each_cpu(i, smt_mask) {


Re: [PATCH -tip 00/32] Core scheduling (v9)

2020-11-24 Thread Joel Fernandes
On Tue, Nov 24, 2020 at 6:48 AM Vincent Guittot
 wrote:
>
> Hi Joel,
>
> On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
>  wrote:
> >
> > Core-Scheduling
> > ===
> > Enclosed is series v9 of core scheduling.
> > v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'"))..
> > I hope that this version is acceptable to be merged (pending any new review
>
> ./scripts/get_maintainer.pl is quite useful to make sure that all
> maintainers are cced and helps a lot to get some reviews

Apologies. I was just going by folks who were CC'd on previous series.
I am new to doing this series's postings.  Sorry if I missed you and
will run get_maintainers henceforth!

 - Joel


> > comments that arise) as the main issues in the past are all resolved:
> >  1. Vruntime comparison.
> >  2. Documentation updates.
> >  3. CGroup and per-task interface developed by Google and Oracle.
> >  4. Hotplug fixes.
> > Almost all patches also have Reviewed-by or Acked-by tag. See below for full
> > list of changes in v9.
> >
> > Introduction of feature
> > ===
> > Core scheduling is a feature that allows only trusted tasks to run
> > concurrently on cpus sharing compute resources (eg: hyperthreads on a
> > core). The goal is to mitigate the core-level side-channel attacks
> > without requiring to disable SMT (which has a significant impact on
> > performance in some situations). Core scheduling (as of v7) mitigates
> > user-space to user-space attacks and user to kernel attack when one of
> > the siblings enters the kernel via interrupts or system call.
> >
> > By default, the feature doesn't change any of the current scheduler
> > behavior. The user decides which tasks can run simultaneously on the
> > same core (for now by having them in the same tagged cgroup). When a tag
> > is enabled in a cgroup and a task from that cgroup is running on a
> > hardware thread, the scheduler ensures that only idle or trusted tasks
> > run on the other sibling(s). Besides security concerns, this feature can
> > also be beneficial for RT and performance applications where we want to
> > control how tasks make use of SMT dynamically.
> >
> > Both a CGroup and Per-task interface via prctl(2) are provided for 
> > configuring
> > core sharing. More details are provided in documentation patch.  Kselftests 
> > are
> > provided to verify the correctness/rules of the interface.
> >
> > Testing
> > ===
> > ChromeOS testing shows 300% improvement in keypress latency on a Google
> > docs key press with Google hangout test (the maximum latency drops from 
> > 150ms
> > to 50ms for keypresses).
> >
> > Julien: TPCC tests showed improvements with core-scheduling as below. With 
> > kernel
> > protection enabled, it does not show any regression. Possibly ASI will 
> > improve
> > the performance for those who choose kernel protection (can be toggled 
> > through
> > sched_core_protect_kernel sysctl).
> > average stdev   diff
> > baseline (SMT on)   1197.27244.78312824
> > core sched (   kernel protect)  412.989545.42734343 -65.51%
> > core sched (no kernel protect)  686.651571.77756931 -42.65%
> > nosmt   408.667 39.39042872 -65.87%
> > (Note these results are from v8).
> >
> > Vineeth tested sysbench and does not see any regressions.
> > Hong and Aubrey tested v9 and see results similar to v8. There is a known 
> > issue
> > with uperf that does regress. This appears to be because of ksoftirq heavily
> > contending with other tasks on the core. The consensus is this can be 
> > improved
> > in the future.
> >
> > Changes in v9
> > =
> > - Note that the vruntime snapshot change is written in 2 patches to show the
> >   progression of the idea and prevent merge conflicts:
> > sched/fair: Snapshot the min_vruntime of CPUs on force idle
> > sched: Improve snapshotting of min_vruntime for CGroups
> >   Same with the RT priority inversion change:
> > sched: Fix priority inversion of cookied task with sibling
> > sched: Improve snapshotting of min_vruntime for CGroups
> > - Disable coresched on certain AMD HW.
> >
> > Changes in v8
> > =
> > - New interface/API implementation
> >   - Joel
> > - Revised kernel protection patch
> >   - Joel
> > - Revised Hotplug fixes
> >   - Joel
> > - Minor 

Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()

2020-11-20 Thread Joel Fernandes
On Fri, Nov 20, 2020 at 10:56:09AM +1100, Singh, Balbir wrote:
[..] 
> > +#ifdef CONFIG_SMP
> > +static struct task_struct *pick_task_fair(struct rq *rq)
> > +{
> > +   struct cfs_rq *cfs_rq = >cfs;
> > +   struct sched_entity *se;
> > +
> > +   if (!cfs_rq->nr_running)
> > +   return NULL;
> > +
> > +   do {
> > +   struct sched_entity *curr = cfs_rq->curr;
> > +
> > +   se = pick_next_entity(cfs_rq, NULL);
> > +
> > +   if (curr) {
> > +   if (se && curr->on_rq)
> > +   update_curr(cfs_rq);
> > +
> > +   if (!se || entity_before(curr, se))
> > +   se = curr;
> > +   }
> 
> Do we want to optimize this a bit 
> 
> if (curr) {
>   if (!se || entity_before(curr, se))
>   se = curr;
> 
>   if ((se != curr) && curr->on_rq)
>   update_curr(cfs_rq);
> 
> }

Do you see a difference in codegen? What's the optimization?

Also in later patches this code changes, so it should not matter:
See: 3e0838fa3c51 ("sched/fair: Fix pick_task_fair crashes due to empty rbtree")

thanks,

 - Joel



Re: [PATCH -tip 01/32] sched: Wrap rq::lock access

2020-11-20 Thread Joel Fernandes
On Fri, Nov 20, 2020 at 10:31:39AM +1100, Singh, Balbir wrote:
> On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra 
> > 
> > In preparation of playing games with rq->lock, abstract the thing
> > using an accessor.
> > 
> 
> Could you clarify games? I presume the intention is to redefine the scope
> of the lock based on whether core sched is enabled or not? I presume patch
> 4/32 has the details.

Your line wrapping broke, I fixed it.

That is in fact the game. By wrapping it, the nature of the locking is
dynamic based on whether core sched is enabled or not (both statically and
dynamically).

thanks,

 - Joel

 
> Balbir Singh


Re: [PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-11-19 Thread Joel Fernandes
On Thu, Nov 19, 2020 at 12:16:15PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 19, 2020 at 02:44:35PM -0500, Joel Fernandes wrote:
> > On Thu, Nov 19, 2020 at 2:22 PM Paul E. McKenney  wrote:
> > > > > > > On Wed, Nov 18, 2020 at 11:15:41AM -0500, Joel Fernandes (Google) 
> > > > > > > wrote:
> > > > > > > > After rcu_do_batch(), add a check for whether the seglen counts 
> > > > > > > > went to
> > > > > > > > zero if the list was indeed empty.
> > > > > > > >
> > > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > >
> > > > > > > Queued for testing and further review, thank you!
> > > > > >
> > > > > > FYI, the second of the two checks triggered in all four one-hour 
> > > > > > runs of
> > > > > > TREE01, all four one-hour runs of TREE04, and one of the four 
> > > > > > one-hour
> > > > > > runs of TREE07.  This one:
> > > > > >
> > > > > > WARN_ON_ONCE(count != 0 && 
> > > > > > rcu_segcblist_n_segment_cbs(>cblist) == 0);
> > > > > >
> > > > > > That is, there are callbacks in the list, but the sum of the segment
> > > > > > counts is nevertheless zero.  The ->nocb_lock is held.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > FWIW, TREE01 reproduces it very quickly compared to the other two
> > > > > scenarios, on all four run, within five minutes.
> > > >
> > > > So far for TREE01, I traced it down to an rcu_barrier happening so it 
> > > > could
> > > > be related to some interaction with rcu_barrier() (Just a guess).
> > >
> > > Well, rcu_barrier() and srcu_barrier() are the only users of
> > > rcu_segcblist_entrain(), if that helps.  Your modification to that
> > > function looks plausible to me, but the system's opinion always overrules
> > > mine.  ;-)
> > 
> > Right. Does anything the bypass code standout? That happens during
> > rcu_barrier() as well, and it messes with the lengths.
> 
> In theory, rcu_barrier_func() flushes the bypass before doing the
> entrain, and does the rcu_segcblist_entrain() afterwards.
> 
> Ah, and that is the issue.  If ->cblist is empty and ->nocb_bypass
> is not, then ->cblist length will be nonzero, and none of the
> segments will be nonzero.
> 
> So you need something like this for that second WARN, correct?
> 
>   WARN_ON_ONCE(!rcu_segcblist_empty(>cblist) &&
>rcu_segcblist_n_segment_cbs(>cblist) == 0);
> 
> This is off the cuff, so should be taken with a grain of salt.  And
> there might well be other similar issues.

Ah, makes sense. Or maybe should be made like the other warning?
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) && count != 0
&& rcu_segcblist_n_segment_cbs(>cblist) == 0);

Though your warning is better.

I will try these out and see if it goes away. I am afraid though that there
is an issue with !NOCB code since you had other configs that were failing
similarly.. :-\.

thanks, :-)

 - Joel



Re: [PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-11-19 Thread Joel Fernandes
On Thu, Nov 19, 2020 at 2:22 PM Paul E. McKenney  wrote:
> > > > > On Wed, Nov 18, 2020 at 11:15:41AM -0500, Joel Fernandes (Google) 
> > > > > wrote:
> > > > > > After rcu_do_batch(), add a check for whether the seglen counts 
> > > > > > went to
> > > > > > zero if the list was indeed empty.
> > > > > >
> > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > >
> > > > > Queued for testing and further review, thank you!
> > > >
> > > > FYI, the second of the two checks triggered in all four one-hour runs of
> > > > TREE01, all four one-hour runs of TREE04, and one of the four one-hour
> > > > runs of TREE07.  This one:
> > > >
> > > > WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(>cblist) == 
> > > > 0);
> > > >
> > > > That is, there are callbacks in the list, but the sum of the segment
> > > > counts is nevertheless zero.  The ->nocb_lock is held.
> > > >
> > > > Thoughts?
> > >
> > > FWIW, TREE01 reproduces it very quickly compared to the other two
> > > scenarios, on all four run, within five minutes.
> >
> > So far for TREE01, I traced it down to an rcu_barrier happening so it could
> > be related to some interaction with rcu_barrier() (Just a guess).
>
> Well, rcu_barrier() and srcu_barrier() are the only users of
> rcu_segcblist_entrain(), if that helps.  Your modification to that
> function looks plausible to me, but the system's opinion always overrules
> mine.  ;-)

Right. Does anything the bypass code standout? That happens during
rcu_barrier() as well, and it messes with the lengths.

> > 'p2' print below is the panic on the second warning. It executes 43 
> > callbacks
> > from the segcb list, but the list length still does not appear to be 0.
> >
> > I'll debug it more:
> >
> > [  191.085702]  rcuop/0-120 75782125us : rcu_invoke_callback: 
> > rcu_preempt rhp=6a881152 func=__d_free
> > [  191.844028]  rcuop/0-120d..1 75796122us : rcu_segcb_stats: 
> > SegCbDequeued seglen: (DONE=0, WAIT=43, NEXT_READY=0, NEXT=0) gp_seq: 
> > (DONE=0, WAIT=11656, NEXT_READY=11656, NEXT=0)
> > [  191.846493]  rcuop/0-120 75796122us : rcu_invoke_callback: 
> > rcu_preempt rhp=017536a2 func=i_callback
> > [  191.848160]  rcuop/0-120 75796123us : rcu_invoke_callback: 
> > rcu_preempt rhp=14235c71 func=__d_free
> > [  191.849695]  rcuop/0-120 75796123us : rcu_invoke_callback: 
> > rcu_preempt rhp=368c5928 func=i_callback
> > [  191.851262]  rcuop/0-120 75796124us : rcu_invoke_callback: 
> > rcu_preempt rhp=bdbea790 func=__d_free
> > [  191.852800]  rcuop/0-120 75796124us : rcu_invoke_callback: 
> > rcu_preempt rhp=34b99f3d func=rcu_barrier_callback
> > [  192.526784]  rcuop/0-120d..1 75809162us : rcu_segcb_stats: 
> > SegCbDequeued seglen: (DONE=0, WAIT=0, NEXT_READY=0, NEXT=0) gp_seq: 
> > (DONE=0, WAIT=11656, NEXT_READY=11656, NEXT=0)
>
> Quite the coincidence that WAIT and NEXT_READY have exactly the same
> number of callbacks, isn't it?  Or am I being too suspicious today?

Those numbers are gp_seq :-D.

thanks,

 - Joel



>
> Thanx, Paul
>
> > [  192.529132]  rcuop/0-120 75809163us : rcu_invoke_callback: 
> > rcu_preempt rhp=2d6a3fce func=rcu_sync_func
> > [  192.530807]  rcuop/0-120 75809165us : rcu_invoke_callback: 
> > rcu_preempt rhp=9aa91c97 func=destroy_sched_domains_rcu
> > [  192.532556]  rcuop/0-120 75809170us : rcu_invoke_callback: 
> > rcu_preempt rhp=2bb5a998 func=destroy_sched_domains_rcu
> > [  192.534303]  rcuop/0-120 75809172us : rcu_invoke_callback: 
> > rcu_preempt rhp=bcc2369a func=destroy_sched_domains_rcu
> > [  192.536053]  rcuop/0-120 75809174us : rcu_invoke_callback: 
> > rcu_preempt rhp=4dcec39b func=destroy_sched_domains_rcu
> > [  192.537802]  rcuop/0-120 75809176us : rcu_invoke_callback: 
> > rcu_preempt rhp=dedb509d func=destroy_sched_domains_rcu
> > [  192.539553]  rcuop/0-120 75809178us : rcu_invoke_callback: 
> > rcu_preempt rhp=6fe7dd9e func=destroy_sched_domains_rcu
> > [  192.541299]  rcuop/0-120 75809180us : rcu_invoke_callback: 
> > rcu_preempt rhp=5a212061 func=destroy_sched_domains_rcu
&

Re: [PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-11-19 Thread Joel Fernandes
On Wed, Nov 18, 2020 at 07:56:13PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 18, 2020 at 07:52:23PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 18, 2020 at 12:13:35PM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 18, 2020 at 11:15:41AM -0500, Joel Fernandes (Google) wrote:
> > > > After rcu_do_batch(), add a check for whether the seglen counts went to
> > > > zero if the list was indeed empty.
> > > > 
> > > > Signed-off-by: Joel Fernandes (Google) 
> > > 
> > > Queued for testing and further review, thank you!
> > 
> > FYI, the second of the two checks triggered in all four one-hour runs of
> > TREE01, all four one-hour runs of TREE04, and one of the four one-hour
> > runs of TREE07.  This one:
> > 
> > WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(>cblist) == 0);
> > 
> > That is, there are callbacks in the list, but the sum of the segment
> > counts is nevertheless zero.  The ->nocb_lock is held.
> > 
> > Thoughts?
> 
> FWIW, TREE01 reproduces it very quickly compared to the other two
> scenarios, on all four run, within five minutes.

So far for TREE01, I traced it down to an rcu_barrier happening so it could
be related to some interaction with rcu_barrier() (Just a guess).

'p2' print below is the panic on the second warning. It executes 43 callbacks
from the segcb list, but the list length still does not appear to be 0.

I'll debug it more:

[  191.085702]  rcuop/0-120 75782125us : rcu_invoke_callback: 
rcu_preempt rhp=6a881152 func=__d_free
[  191.844028]  rcuop/0-120d..1 75796122us : rcu_segcb_stats: 
SegCbDequeued seglen: (DONE=0, WAIT=43, NEXT_READY=0, NEXT=0) gp_seq: (DONE=0, 
WAIT=11656, NEXT_READY=11656, NEXT=0)
[  191.846493]  rcuop/0-120 75796122us : rcu_invoke_callback: 
rcu_preempt rhp=017536a2 func=i_callback
[  191.848160]  rcuop/0-120 75796123us : rcu_invoke_callback: 
rcu_preempt rhp=14235c71 func=__d_free
[  191.849695]  rcuop/0-120 75796123us : rcu_invoke_callback: 
rcu_preempt rhp=368c5928 func=i_callback
[  191.851262]  rcuop/0-120 75796124us : rcu_invoke_callback: 
rcu_preempt rhp=bdbea790 func=__d_free
[  191.852800]  rcuop/0-120 75796124us : rcu_invoke_callback: 
rcu_preempt rhp=34b99f3d func=rcu_barrier_callback
[  192.526784]  rcuop/0-120d..1 75809162us : rcu_segcb_stats: 
SegCbDequeued seglen: (DONE=0, WAIT=0, NEXT_READY=0, NEXT=0) gp_seq: (DONE=0, 
WAIT=11656, NEXT_READY=11656, NEXT=0)
[  192.529132]  rcuop/0-120 75809163us : rcu_invoke_callback: 
rcu_preempt rhp=2d6a3fce func=rcu_sync_func
[  192.530807]  rcuop/0-120 75809165us : rcu_invoke_callback: 
rcu_preempt rhp=9aa91c97 func=destroy_sched_domains_rcu
[  192.532556]  rcuop/0-120 75809170us : rcu_invoke_callback: 
rcu_preempt rhp=2bb5a998 func=destroy_sched_domains_rcu
[  192.534303]  rcuop/0-120 75809172us : rcu_invoke_callback: 
rcu_preempt rhp=bcc2369a func=destroy_sched_domains_rcu
[  192.536053]  rcuop/0-120 75809174us : rcu_invoke_callback: 
rcu_preempt rhp=4dcec39b func=destroy_sched_domains_rcu
[  192.537802]  rcuop/0-120 75809176us : rcu_invoke_callback: 
rcu_preempt rhp=dedb509d func=destroy_sched_domains_rcu
[  192.539553]  rcuop/0-120 75809178us : rcu_invoke_callback: 
rcu_preempt rhp=6fe7dd9e func=destroy_sched_domains_rcu
[  192.541299]  rcuop/0-120 75809180us : rcu_invoke_callback: 
rcu_preempt rhp=5a212061 func=destroy_sched_domains_rcu
[  192.543043]  rcuop/0-120 75809182us : rcu_invoke_callback: 
rcu_preempt rhp=c914935f func=destroy_sched_domains_rcu
[  192.544792]  rcuop/0-120 75809184us : rcu_invoke_callback: 
rcu_preempt rhp=19fa3368 func=destroy_sched_domains_rcu
[  192.546539]  rcuop/0-120 75809186us : rcu_invoke_callback: 
rcu_preempt rhp=ab06c069 func=destroy_sched_domains_rcu
[  192.548289]  rcuop/0-120 75809188us : rcu_invoke_callback: 
rcu_preempt rhp=3c134d6b func=destroy_sched_domains_rcu
[  192.550037]  rcuop/0-120 75809190us : rcu_invoke_callback: 
rcu_preempt rhp=cd1fda6c func=destroy_sched_domains_rcu
[  192.551790]  rcuop/0-120 75809192us : rcu_invoke_callback: 
rcu_preempt rhp=5e2c676e func=destroy_sched_domains_rcu
[  192.553576]  rcuop/0-120 75809194us : rcu_invoke_callback: 
rcu_preempt rhp=ef38f46f func=destroy_sched_domains_rcu
[  192.555314]  rcuop/0-120 75809196us : rcu_invoke_callback: 
rcu_preempt rhp=80458170 func=destroy_sched_domains_rcu
[  192.557054]  rcuop/0-120 75809198us : rcu_invoke_callback: 
rcu

[PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-11-18 Thread Joel Fernandes (Google)
After rcu_do_batch(), add a check for whether the seglen counts went to
zero if the list was indeed empty.

Signed-off-by: Joel Fernandes (Google) 
---
v1->v2: Added more debug checks.

 kernel/rcu/rcu_segcblist.c | 12 
 kernel/rcu/rcu_segcblist.h |  3 +++
 kernel/rcu/tree.c  |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 5059b6102afe..6e98bb3804f0 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,6 +94,18 @@ static long rcu_segcblist_get_seglen(struct rcu_segcblist 
*rsclp, int seg)
return READ_ONCE(rsclp->seglen[seg]);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp)
+{
+   long len = 0;
+   int i;
+
+   for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
+   len += rcu_segcblist_get_seglen(rsclp, i);
+
+   return len;
+}
+
 /* Set the length of a segment of the rcu_segcblist structure. */
 static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
 {
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9faaf51..46a42d77f7e1 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,6 +15,9 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
+
 void rcu_cblist_init(struct rcu_cblist *rclp);
 void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f5b61e10f1de..91e35b521e51 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2553,6 +2553,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(>cblist));
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
 count != 0 && rcu_segcblist_empty(>cblist));
+   WARN_ON_ONCE(count == 0 && rcu_segcblist_n_segment_cbs(>cblist) != 
0);
+   WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(>cblist) == 
0);
 
rcu_nocb_unlock_irqrestore(rdp, flags);
 
-- 
2.29.2.299.gdc1121823c-goog


[PATCH] rcu/segcblist: Add debug check for whether seglen is 0 for empty list

2020-11-18 Thread Joel Fernandes (Google)
After rcu_do_batch(), add a check for whether the seglen counts went to
zero if the list was indeed empty.

Signed-off-by: Joel Fernandes (Google) 

---
 kernel/rcu/rcu_segcblist.c | 12 
 kernel/rcu/rcu_segcblist.h |  3 +++
 kernel/rcu/tree.c  |  1 +
 3 files changed, 16 insertions(+)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 5059b6102afe..6e98bb3804f0 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,6 +94,18 @@ static long rcu_segcblist_get_seglen(struct rcu_segcblist 
*rsclp, int seg)
return READ_ONCE(rsclp->seglen[seg]);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp)
+{
+   long len = 0;
+   int i;
+
+   for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
+   len += rcu_segcblist_get_seglen(rsclp, i);
+
+   return len;
+}
+
 /* Set the length of a segment of the rcu_segcblist structure. */
 static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
 {
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9faaf51..46a42d77f7e1 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,6 +15,9 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
+
 void rcu_cblist_init(struct rcu_cblist *rclp);
 void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f5b61e10f1de..928bd10c9c3b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2553,6 +2553,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(>cblist));
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
 count != 0 && rcu_segcblist_empty(>cblist));
+   WARN_ON_ONCE(count == 0 && !rcu_segcblist_n_segment_cbs(>cblist));
 
rcu_nocb_unlock_irqrestore(rdp, flags);
 
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle

2020-11-17 Thread Joel Fernandes (Google)
During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

This resolves several performance issues that were seen in ChromeOS
audio usecase.

NOTE: Note, this patch will be improved in a later patch. It is just
  kept here as the basis for the later patch and to make rebasing
  easier. Further, it may make reverting the improvement easier in
  case the improvement causes any regression.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 33 -
 kernel/sched/fair.c  | 40 
 kernel/sched/sched.h |  5 +
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52d0e83072a4..4ee4902c2cf5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-   u64 vruntime = b->se.vruntime;
-
-   /*
-* Normalize the vruntime if tasks are in different cpus.
-*/
-   if (task_cpu(a) != task_cpu(b)) {
-   vruntime -= task_cfs_rq(b)->min_vruntime;
-   vruntime += task_cfs_rq(a)->min_vruntime;
-   }
-
-   return !((s64)(a->se.vruntime - vruntime) <= 0);
-   }
+   if (pa == MAX_RT_PRIO + MAX_NICE)   /* fair */
+   return cfs_prio_less(a, b);
 
return false;
 }
@@ -5144,6 +5133,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
+   bool fi_before = false;
bool need_sync;
int i, j, cpu;
 
@@ -5208,6 +5198,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
+   fi_before = true;
rq->core->core_forceidle = false;
}
for_each_cpu(i, smt_mask) {
@@ -5219,6 +5210,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
update_rq_clock(rq_i);
}
 
+   /* Reset the snapshot if core is no longer in force-idle. */
+   if (!fi_before) {
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+   rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+   }
+   }
+
/*
 * Try and select tasks for each sibling in decending sched_class
 * order.
@@ -5355,6 +5354,14 @@ next_class:;
resched_curr(rq_i);
}
 
+   /* Snapshot if core is in force-idle. */
+   if (!fi_before && rq->core->core_forceidle) {
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+   rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+   }
+   }
+
 done:
set_next_task(rq, next);
return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42965c4fd71f..de82f88ba98c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10726,6 +10726,46 @@ static inline void task_tick_core(struct rq *rq, 
struct task_struct *curr)
__entity_slice_used(>se, MIN_NR_TASKS_DURING_FORCEIDLE))
resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+   bool samecpu = task_cpu(a) == task_cpu(b);
+   struct sched_entity *sea = >se;
+   struct sched_entity *seb = >se;
+   struct cfs_rq *cfs_rqa;
+   struct cfs_rq *cfs_rqb;
+   s64 delta;
+
+   if (samecpu) {
+   /* vruntime is per cfs_rq */
+   while (!is_same_group(sea, seb)) {
+   int sea_depth = sea->depth;
+   int seb_depth = seb->depth;
+   if (sea_depth >= seb_depth)
+   sea = parent_entity(sea);
+   if (sea_depth <= seb_depth)
+   seb = parent_entity

[PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case

2020-11-17 Thread Joel Fernandes (Google)
From: Vineeth Pillai 

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 15 ---
 kernel/sched/fair.c  | 40 
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bd0b0bbb040..52d0e83072a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5206,16 +5206,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* reset state */
rq->core->core_cookie = 0UL;
+   if (rq->core->core_forceidle) {
+   need_sync = true;
+   rq->core->core_forceidle = false;
+   }
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
rq_i->core_pick = NULL;
 
-   if (rq_i->core_forceidle) {
-   need_sync = true;
-   rq_i->core_forceidle = false;
-   }
-
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -5335,8 +5334,10 @@ next_class:;
if (!rq_i->core_pick)
continue;
 
-   if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-   rq_i->core_forceidle = true;
+   if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+   !rq_i->core->core_forceidle) {
+   rq_i->core->core_forceidle = true;
+   }
 
if (i == cpu) {
rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f53681cd263e..42965c4fd71f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10692,6 +10692,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+   u64 slice = sched_slice(cfs_rq_of(se), se);
+   u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+   return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE  2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+   if (!sched_core_enabled(rq))
+   return;
+
+   /*
+* If runqueue has only one task which used up its slice and
+* if the sibling is forced idle, then trigger schedule to
+* give forced idle task a chance.
+*
+* sched_slice() considers only this active rq and it gets the
+* whole slice. But during force idle, we have siblings acting
+* like a single runqueue and hence we need to consider runnable
+* tasks on this cpu and the forced idle cpu. Ideally, we should
+* go through the forced idle rq, but that would be a perf hit.
+* We can assume that the forced idle cpu has atleast
+* MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+* if we need to give up the cpu.
+*/
+   if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+   __entity_slice_used(>se, MIN_NR_TASKS_DURING_FORCEIDLE))
+   resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10715,6 +10753,8 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
 
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+   task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63b28e1843ee..be656ca8693d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,12 +1069,12 @@ struct rq {
unsigned intcore_enabled;
unsigned intcore_sched_seq;
struct rb_root  core_tree;
-   unsigned char   core_forceidle;
 
/* shared state */
unsigned intcore_task_seq;
unsigned intcore_pick_seq;
unsigned long   core_cookie;
+   unsigned char   core_forceidle;
 #endif
 };
 
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 21/32] sched: CGroup tagging interface for core scheduling

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 183 +--
 kernel/sched/sched.h |   4 +
 2 files changed, 181 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f807a84cc30..b99a7493d590 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,37 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+   return !RB_EMPTY_NODE(>core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   struct task_struct *task;
+
+   while (!sched_core_empty(rq)) {
+   task = sched_core_first(rq);
+   rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
+   }
+   rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
@@ -188,10 +219,11 @@ static void sched_core_dequeue(struct rq *rq, struct 
task_struct *p)
 {
rq->core->core_task_seq++;
 
-   if (!p->core_cookie)
+   if (!sched_core_enqueued(p))
return;
 
rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
 }
 
 /*
@@ -255,8 +287,24 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
-   cpu_rq(cpu)->core_enabled = enabled;
+   for_each_possible_cpu(cpu) {
+   struct rq *rq = cpu_rq(cpu);
+
+   WARN_ON_ONCE(enabled == rq->core_enabled);
+
+   if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
>= 2)) {
+   /*
+* All active and migrating tasks will have already
+* been removed from core queue when we clear the
+* cgroup tags. However, dying tasks could still be
+* left in core queue. Flush them here.
+*/
+   if (!enabled)
+   sched_core_flush(cpu);
+
+   rq->core_enabled = enabled;
+   }
+   }
 
return 0;
 }
@@ -266,7 +314,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-   // XXX verify there are no cookie tasks (yet)
+   int cpu;
+
+   /* verify there are no cookie tasks (yet) */
+   for_each_online_cpu(cpu)
+   BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +326,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-   // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
 }
@@ -300,6 +350,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3978,6 +4029,9 @@ int sched_fork

[PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups

2020-11-17 Thread Joel Fernandes (Google)
A previous patch improved cross-cpu vruntime comparison opertations in
pick_next_task(). Improve it further for tasks in CGroups.

In particular, for cross-CPU comparisons, we were previously going to
the root-level se(s) for both the task being compared. That was strange.
This patch instead finds the se(s) for both tasks that have the same
parent (which may be different from root).

A note about the min_vruntime snapshot and force idling:
Abbreviations: fi: force-idled now? ; fib: force-idled before?
During selection:
When we're not fi, we need to update snapshot.
when we're fi and we were not fi, we must update snapshot.
When we're fi and we were already fi, we must not update snapshot.

Which gives:
fib fi  update?
0   0   1
0   1   1
1   0   1
1   1   0
So the min_vruntime snapshot needs to be updated when: !(fib && fi).

Also, the cfs_prio_less() function needs to be aware of whether the core
is in force idle or not, since it will be use this information to know
whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
this information along via pick_task() -> prio_less().

Reviewed-by: Vineeth Pillai 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 61 +
 kernel/sched/fair.c  | 80 
 kernel/sched/sched.h |  7 +++-
 3 files changed, 97 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b373b592680..20125431af87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -101,7 +101,7 @@ static inline int __task_prio(struct task_struct *p)
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, 
bool in_fi)
 {
 
int pa = __task_prio(a), pb = __task_prio(b);
@@ -116,7 +116,7 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
if (pa == MAX_RT_PRIO + MAX_NICE)   /* fair */
-   return cfs_prio_less(a, b);
+   return cfs_prio_less(a, b, in_fi);
 
return false;
 }
@@ -130,7 +130,7 @@ static inline bool __sched_core_less(struct task_struct *a, 
struct task_struct *
return false;
 
/* flip prio, so high prio is leftmost */
-   if (prio_less(b, a))
+   if (prio_less(b, a, task_rq(a)->core->core_forceidle))
return true;
 
return false;
@@ -5101,7 +5101,7 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max, bool in_fi)
 {
struct task_struct *class_pick, *cookie_pick;
unsigned long cookie = rq->core->core_cookie;
@@ -5116,7 +5116,7 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
 * higher priority than max.
 */
if (max && class_pick->core_cookie &&
-   prio_less(class_pick, max))
+   prio_less(class_pick, max, in_fi))
return idle_sched_class.pick_task(rq);
 
return class_pick;
@@ -5135,13 +5135,15 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
 * the core (so far) and it must be selected, otherwise we must go with
 * the cookie pick in order to satisfy the constraint.
 */
-   if (prio_less(cookie_pick, class_pick) &&
-   (!max || prio_less(max, class_pick)))
+   if (prio_less(cookie_pick, class_pick, in_fi) &&
+   (!max || prio_less(max, class_pick, in_fi)))
return class_pick;
 
return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool 
in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -5230,9 +5232,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
if (!next->core_cookie) {
rq->core_pick = NULL;
+   /*
+* For robustness, update the min_vruntime_fi for
+* unconstrained picks as well.
+*/
+   WARN_ON_ONCE(fi_before);
+   task_vruntime_update(rq, next, false);
goto done;
}
-   need_sync = true;
}
 
for_each_cpu(i, s

[PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit

2020-11-17 Thread Joel Fernandes (Google)
Add a generic_idle_{enter,exit} helper function to enter and exit kernel
protection when entering and exiting idle, respectively.

While at it, remove a stale RCU comment.

Reviewed-by: Alexandre Chartre 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/entry-common.h | 18 ++
 kernel/sched/idle.c  | 11 ++-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 022e1f114157..8f34ae625f83 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
&& _TIF_UNSAFE_RET != 0;
 }
+
+/**
+ * generic_idle_enter - General tasks to perform during idle entry.
+ */
+static inline void generic_idle_enter(void)
+{
+   /* Entering idle ends the protected kernel region. */
+   sched_core_unsafe_exit();
+}
+
+/**
+ * generic_idle_exit  - General tasks to perform during idle exit.
+ */
+static inline void generic_idle_exit(void)
+{
+   /* Exiting idle (re)starts the protected kernel region. */
+   sched_core_unsafe_enter();
+}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8bdb214eb78f..ee4f91396c31 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
  */
 #include "sched.h"
 
+#include 
 #include 
 
 /* Linker adds these: start and end of __cpuidle functions */
@@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
 
 static noinline int __cpuidle cpu_idle_poll(void)
 {
+   generic_idle_enter();
trace_cpu_idle(0, smp_processor_id());
stop_critical_timings();
rcu_idle_enter();
@@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
rcu_idle_exit();
start_critical_timings();
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+   generic_idle_exit();
 
return 1;
 }
@@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
return;
}
 
-   /*
-* The RCU framework needs to be told that we are entering an idle
-* section, so no more rcu read side critical sections and one more
-* step to the grace period
-*/
+   generic_idle_enter();
 
if (cpuidle_not_available(drv, dev)) {
tick_nohz_idle_stop_tick();
@@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
 */
if (WARN_ON_ONCE(irqs_disabled()))
local_irq_enable();
+
+   generic_idle_exit();
 }
 
 /*
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 32/32] sched: Debug bits...

2020-11-17 Thread Joel Fernandes (Google)
Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 35 ++-
 kernel/sched/fair.c |  9 +
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01938a2154fd..bbeeb18d460e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -127,6 +127,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -317,12 +321,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 DEFINE_STATIC_KEY_TRUE(sched_coresched_supported);
@@ -5486,6 +5494,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
rq->core_pick = NULL;
return next;
}
@@ -5580,6 +5595,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5596,6 +5614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5617,6 +5637,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -5658,13 +5679,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5704,6 +5733,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a89c7c917cc6..81c8a50ab4c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10798,6 +10798,15 @@ static void se_fi_update(struct sched_entity *se, 
unsigned int fi_seq, 

[PATCH -tip 31/32] sched: Add a coresched command line option

2020-11-17 Thread Joel Fernandes (Google)
Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
issues. Detect this and don't enable core scheduling as it can
needlessly slow those device down.

However, some users may want core scheduling even if the hardware is
secure. To support them, add a coresched= option which defaults to
'secure' and can be overridden to 'on' if the user wants to enable
coresched even if the HW is not vulnerable. 'off' would disable
core scheduling in any case.

Also add a sched_debug entry to indicate if core scheduling is turned on
or not.

Reviewed-by: Alexander Graf 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/kernel-parameters.txt | 14 ++
 arch/x86/kernel/cpu/bugs.c| 19 
 include/linux/cpu.h   |  1 +
 include/linux/sched/smt.h |  4 ++
 kernel/cpu.c  | 43 +++
 kernel/sched/core.c   |  6 +++
 kernel/sched/debug.c  |  4 ++
 7 files changed, 91 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b185c6ed4aba..9cd2cf7c18d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -698,6 +698,20 @@
/proc//coredump_filter.
See also Documentation/filesystems/proc.rst.
 
+   coresched=  [SCHED_CORE] This feature allows the Linux scheduler
+   to force hyperthread siblings of a CPU to only execute 
tasks
+   concurrently on all hyperthreads that are running 
within the
+   same core scheduling group.
+   Possible values are:
+   'on' - Enable scheduler capability to core schedule.
+   By default, no tasks will be core scheduled, but the 
coresched
+   interface can be used to form groups of tasks that are 
forced
+   to share a core.
+   'off' - Disable scheduler capability to core schedule.
+   'secure' - Like 'on' but only enable on systems 
affected by
+   MDS or L1TF vulnerabilities. 'off' otherwise.
+   Default: 'secure'.
+
coresight_cpu_debug.enable
[ARM,ARM64]
Format: 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index dece79e4d1e9..f3163f4a805c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -43,6 +43,7 @@ static void __init mds_select_mitigation(void);
 static void __init mds_print_mitigation(void);
 static void __init taa_select_mitigation(void);
 static void __init srbds_select_mitigation(void);
+static void __init coresched_select(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -103,6 +104,9 @@ void __init check_bugs(void)
if (boot_cpu_has(X86_FEATURE_STIBP))
x86_spec_ctrl_mask |= SPEC_CTRL_STIBP;
 
+   /* Update whether core-scheduling is needed. */
+   coresched_select();
+
/* Select the proper CPU mitigations before patching alternatives: */
spectre_v1_select_mitigation();
spectre_v2_select_mitigation();
@@ -1808,4 +1812,19 @@ ssize_t cpu_show_srbds(struct device *dev, struct 
device_attribute *attr, char *
 {
return cpu_show_common(dev, attr, buf, X86_BUG_SRBDS);
 }
+
+/*
+ * When coresched=secure command line option is passed (default), disable core
+ * scheduling if CPU does not have MDS/L1TF vulnerability.
+ */
+static void __init coresched_select(void)
+{
+#ifdef CONFIG_SCHED_CORE
+   if (coresched_cmd_secure() &&
+   !boot_cpu_has_bug(X86_BUG_MDS) &&
+   !boot_cpu_has_bug(X86_BUG_L1TF))
+   static_branch_disable(_coresched_supported);
+#endif
+}
+
 #endif
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d6428aaf67e7..d1f1e64316d6 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -228,4 +228,5 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control 
ctrlval) { return 0;
 extern bool cpu_mitigations_off(void);
 extern bool cpu_mitigations_auto_nosmt(void);
 
+extern bool coresched_cmd_secure(void);
 #endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/sched/smt.h b/include/linux/sched/smt.h
index 59d3736c454c..561064eb3268 100644
--- a/include/linux/sched/smt.h
+++ b/include/linux/sched/smt.h
@@ -17,4 +17,8 @@ static inline bool sched_smt_active(void) { return false; }
 
 void arch_smt_update(void);
 
+#ifdef CONFIG_SCHED_CORE
+extern struct static_key_true sched_coresched_supported;
+#endif
+
 #endif
diff --git a/kernel/cpu.c b/kernel/cpu.c
index fa535eaa4826..f22330c3ab4c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2559,3 +2559,46 @@ bool cpu_mitiga

[PATCH -tip 30/32] Documentation: Add core scheduling documentation

2020-11-17 Thread Joel Fernandes (Google)
Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai 
Signed-off-by: Vineeth Pillai 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 330 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 331 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..01be28d0687a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,330 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group. For finer grained control, 
the
+  ``cpu.core_tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+* ``cpu.core_tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Up to 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.core_tag`` writable only by root and the
+``cpu.core_tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+ / \
+A   B(These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the 
``cpu.core_tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.core_tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants 
to
+allow a subset of child CGroups within a tagged parent CGroup to be 
co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not trac

[PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Acked-by: Paul E. McKenney 
Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 130 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 344499ab29f2..7efce9c9d9cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12e8e6627ab3..3b373b592680 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -202,6 +202,21 @@ static struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned 
long cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -5134,8 +5149,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
const struct sched_class *class;
const struct cpumask *smt_mask;
bool fi_before = false;
+   int i, j, cpu, occ = 0;
bool need_sync;
-   int i, j, cpu;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -5260,6 +5275,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
if (!p)
continue;
 
+   if (!is_task_rq_idle(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -5285,6 +5303,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
}
}
@@ -5324,6 +5343,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq_i->core->core_forceidle = true;
}
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu) {
rq_i->core_pick = NULL;
continue;
@@ -5353,6 +5374,113 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_mask))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+  

[PATCH -tip 29/32] sched: Move core-scheduler interfacing code to a new file

2020-11-17 Thread Joel Fernandes (Google)
core.c is already huge. The core-tagging interface code is largely
independent of it. Move it to its own file to make both files easier to
maintain.

Also make the following changes:
- Fix SWA bugs found by Chris Hyser.
- Fix refcount underrun caused by not zero'ing new task's cookie.

Tested-by: Julien Desfossez 
Reviewed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 809 +---
 kernel/sched/coretag.c | 819 +
 kernel/sched/sched.h   |  51 ++-
 4 files changed, 872 insertions(+), 808 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1d9762b571a..5ef04bdc849f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
return RB_EMPTY_ROOT(>core_tree);
 }
 
-static bool sched_core_enqueued(struct task_struct *task)
-{
-   return !RB_EMPTY_NODE(>core_node);
-}
-
 static struct task_struct *sched_core_first(struct rq *rq)
 {
struct task_struct *task;
@@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
rq->core->core_task_seq++;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
struct task_struct *node_task;
@@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct 
task_struct *p)
rb_insert_color(>core_node, >core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
rq->core->core_task_seq++;
 
@@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
-static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -346,16 +340,6 @@ void sched_core_put(void)
__sched_core_disable();
mutex_unlock(_core_mutex);
 }
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-static bool sched_core_enqueued(struct task_struct *task) { return false; }
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -3834,6 +3818,9 @@ static void __sched_fork(unsigned long clone_flags, 
struct task_struct *p)
p->capture_control = NULL;
 #endif
init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SCHED_CORE
+   p->core_task_cookie = 0;
+#endif
 #ifdef CONFIG_SMP
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
@@ -9118,11 +9105,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-   return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9735,787 +9717,6 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-#ifdef CONFIG_SCHED_CORE
-/*
- * Wrapper representing a complete cookie. The address of the cookie is used as
- * a unique identifier. Each cookie has a unique permutation of the internal
- * cookie fields.
- */
-struct sched_core_cookie {
-   unsigned long task_cookie;
-   unsigned long group_cookie;
-   unsigned long color;
-
-   struct rb_node node;
-   refcount_t refcnt;
-};
-
-/*
- * A simple wrapper around refcount. An allocated sched_core_task_cookie's
- * address is used to compute the cookie of the task.
- */
-struct sched_core_task_cookie {
-   refcount_t refcnt;
-};
-
-/* All active sched_core_cookies */
-static struct rb_root sched_core_cookies = RB_ROOT;
-static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
-
-/*
- * Returns the following:
- * a < b  => -1
- * a == b => 0
- * a > b  => 1
- */
-static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
-const struct sched_core_cookie *b)
-{
-#define COOKIE_CMP_RETURN(fie

[PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree

2020-11-17 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

pick_next_entity() is passed curr == NULL during core-scheduling. Due to
this, if the rbtree is empty, the 'left' variable is set to NULL within
the function. This can cause crashes within the function.

This is not an issue if put_prev_task() is invoked on the currently
running task before calling pick_next_entity(). However, in core
scheduling, it is possible that a sibling CPU picks for another RQ in
the core, via pick_task_fair(). This remote sibling would not get any
opportunities to do a put_prev_task().

Fix it by refactoring pick_task_fair() such that pick_next_entity() is
called with the cfs_rq->curr. This will prevent pick_next_entity() from
crashing if its rbtree is empty.

Also this fixes another possible bug where update_curr() would not be
called on the cfs_rq hierarchy if the rbtree is empty. This could effect
cross-cpu comparison of vruntime.

Suggested-by: Vineeth Remanan Pillai 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12cf068eeec8..51483a00a755 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7029,15 +7029,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
do {
struct sched_entity *curr = cfs_rq->curr;
 
-   se = pick_next_entity(cfs_rq, NULL);
-
-   if (curr) {
-   if (se && curr->on_rq)
-   update_curr(cfs_rq);
+   if (curr && curr->on_rq)
+   update_curr(cfs_rq);
 
-   if (!se || entity_before(curr, se))
-   se = curr;
-   }
+   se = pick_next_entity(cfs_rq, curr);
 
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-11-17 Thread Joel Fernandes (Google)
From: Josh Don 

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B(These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 120 +++---
 kernel/sched/sched.h  |   2 +
 3 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6fbdb1a204bf..c9efdf8ccdf3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -690,6 +690,7 @@ struct task_struct {
unsigned long   core_cookie;
unsigned long   core_task_cookie;
unsigned long   core_group_cookie;
+   unsigned long   core_color;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd75b3d62a97..8f17ec8e993e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9049,9 +9049,6 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9747,6 +9744,7 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 struct sched_core_cookie {
unsigned long task_cookie;
unsigned long group_cookie;
+   unsigned long color;
 
struct rb_node node;
refcount_t refcnt;
@@ -9782,6 +9780,7 @@ static int sched_core_cookie_cmp(const struct 
sched_core_cookie *a,
 
COOKIE_CMP_RETURN(task_cookie);
COOKIE_CMP_RETURN(group_cookie);
+   COOKIE_CMP_RETURN(color);
 
/* all cookie fields match */
return 0;
@@ -9819,7 +9818,7 @@ static void sched_core_put_cookie(struct 
sched_core_cookie *cookie)
 
 /*
  * A task's core cookie is a compound structure composed of various cookie
- * fields (task_cookie, group_cookie). The overall core_cookie is
+ * fields (task_cookie, group_cookie, color). The overall core_cookie is
  * a pointer to a struct containing those values. This function either finds
  * an existing core_cookie or creates a new one, and then updates the task's
  * core_cookie to point to it. Additionally, it handles the necessary reference
@@ -9837,6 +9836,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
struct sched_core_cookie temp = {
.task_cookie= p->core_task_cookie,
.group_cookie   = p->core_group_cookie,
+   .color  = p->core_color
};
const bool is_zero_cookie =
(sched_core_cookie_cmp(, _cookie) == 0);
@@ -9892,6 +9892,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
 
match->task_cookie = temp.task_cookie;
match->group_cookie = temp.group_cookie;
+   match->color = temp.color;
refcount_set(>refcnt, 1);
 
rb_link_node(>node, parent, node);
@@ -9949,6 +9950,9 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
case sched_core_group_cookie_type:
p->core_group_cookie = cookie;
break;
+   case sched_core_color_type:
+   p->core_color = cookie;
+   break;
default:
WARN_ON_ONCE(1);
}
@@ -9967,19 +9971,23 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
sched_core_enqueue(task_rq(p), p);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
+void cpu_c

[PATCH -tip 28/32] kselftest: Add tests for core-sched interface

2020-11-17 Thread Joel Fernandes (Google)
Add a kselftest test to ensure that the core-sched interface is working
correctly.

Tested-by: Julien Desfossez 
Reviewed-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 .../testing/selftests/sched/test_coresched.c  | 818 ++
 4 files changed, 834 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..70ed2758fe23
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,818 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+}
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+char path[50] = {}, *val;
+int fd;
+
+sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+fd = open(path, O_RDONLY, 0666);
+if (fd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+val = calloc(1, 50);
+if (read(fd, val, 50) == -1) {
+   perror("Failed to read group cookie: ");
+   abort();
+}
+
+val[strcspn(val, "\r\n")] = 0;
+
+close(fd);
+return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+tfd = open(tag_path, O_RDONLY, 0666);
+if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+}
+
+if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+   abort();
+}
+
+if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+}
+}
+
+void assert_group_color(char *cgroup_path, const char *color)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+tfd = open

[PATCH -tip 27/32] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG

2020-11-17 Thread Joel Fernandes (Google)
This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Reviewed-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f17ec8e993e..f1d9762b571a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10277,6 +10277,21 @@ static u64 cpu_core_tag_color_read_u64(struct 
cgroup_subsys_state *css, struct c
return tg->core_tag_color;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, 
struct cftype *cft)
+{
+   unsigned long group_cookie, color;
+
+   cpu_core_get_group_cookie_and_color(css_tg(css), _cookie, );
+
+   /*
+* Combine group_cookie and color into a single 64 bit value, for
+* display purposes only.
+*/
+   return (group_cookie << 32) | (color & 0x);
+}
+#endif
+
 struct write_core_tag {
struct cgroup_subsys_state *css;
unsigned long cookie;
@@ -10550,6 +10565,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
{
@@ -10737,6 +10760,14 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
{
-- 
2.29.2.299.gdc1121823c-goog



[PATCH -tip 25/32] sched: Refactor core cookie into struct

2020-11-17 Thread Joel Fernandes (Google)
From: Josh Don 

The overall core cookie is currently a single unsigned long value. This
poses issues as we seek to add additional sub-fields to the cookie. This
patch refactors the core_cookie to be a pointer to a struct containing
an arbitrary set of cookie fields.

We maintain a sorted RB tree of existing core cookies so that multiple
tasks may share the same core_cookie.

This will be especially useful in the next patch, where the concept of
cookie color is introduced.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 481 +--
 kernel/sched/sched.h |  11 +-
 2 files changed, 429 insertions(+), 63 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc36c384364e..bd75b3d62a97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3958,6 +3958,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
unsigned long flags;
+   int __maybe_unused ret;
 
__sched_fork(clone_flags, p);
/*
@@ -4037,20 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(>core_node);
 
-   /*
-* If parent is tagged via per-task cookie, tag the child (either with
-* the parent's cookie, or a new one). The final cookie is calculated
-* by concatenating the per-task cookie with that of the CGroup's.
-*/
-   if (current->core_task_cookie) {
-
-   /* If it is not CLONE_THREAD fork, assign a unique per-task 
tag. */
-   if (!(clone_flags & CLONE_THREAD)) {
-   return sched_core_share_tasks(p, p);
-   }
-   /* Otherwise share the parent's per-task tag. */
-   return sched_core_share_tasks(p, current);
-   }
+   ret = sched_core_fork(p, clone_flags);
+   if (ret)
+   return ret;
 #endif
return 0;
 }
@@ -9059,6 +9049,9 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
+void cpu_core_get_group_cookie(struct task_group *tg,
+  unsigned long *group_cookie_ptr);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9073,11 +9066,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = autogroup_task_group(tsk, tg);
 
 #ifdef CONFIG_SCHED_CORE
-   if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
-   tsk->core_cookie = 0UL;
-
-   if (tg->tagged /* && !tsk->core_cookie ? */)
-   tsk->core_cookie = (unsigned long)tg;
+   sched_core_change_group(tsk, tg);
 #endif
 
tsk->sched_task_group = tg;
@@ -9177,9 +9166,9 @@ static void cpu_cgroup_css_offline(struct 
cgroup_subsys_state *css)
 #ifdef CONFIG_SCHED_CORE
struct task_group *tg = css_tg(css);
 
-   if (tg->tagged) {
+   if (tg->core_tagged) {
sched_core_put();
-   tg->tagged = 0;
+   tg->core_tagged = 0;
}
 #endif
 }
@@ -9751,38 +9740,225 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 
 #ifdef CONFIG_SCHED_CORE
 /*
- * A simple wrapper around refcount. An allocated sched_core_cookie's
- * address is used to compute the cookie of the task.
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
  */
 struct sched_core_cookie {
+   unsigned long task_cookie;
+   unsigned long group_cookie;
+
+   struct rb_node node;
refcount_t refcnt;
 };
 
 /*
- * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
- * @p: The task to assign a cookie to.
- * @cookie: The cookie to assign.
- * @group: is it a group interface or a per-task interface.
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+   refcount_t refcnt;
+};
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {  \
+   if (a->field < b->field)\
+   return -1;  \
+   else if (a->field > b->field)   \
+   return 1;   \
+} while (0)  

[PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-11-17 Thread Joel Fernandes (Google)
Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez 
Reviewed-by: Aubrey Li 
Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  1 +
 include/uapi/linux/prctl.h   |  3 ++
 kernel/sched/core.c  | 51 +---
 kernel/sys.c |  3 ++
 tools/include/uapi/linux/prctl.h |  3 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6a3b0fa952b..79d76c78cc8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,6 +2083,7 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ccca355623a..a95898c75bdf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -4037,8 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
RB_CLEAR_NODE(>core_node);
 
/*
-* Tag child via per-task cookie only if parent is tagged via per-task
-* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+* If parent is tagged via per-task cookie, tag the child (either with
+* the parent's cookie, or a new one). The final cookie is calculated
+* by concatenating the per-task cookie with that of the CGroup's.
 */
if (current->core_task_cookie) {
 
@@ -9855,7 +9857,7 @@ static int sched_core_share_tasks(struct task_struct *t1, 
struct task_struct *t2
unsigned long cookie;
int ret = -ENOMEM;
 
-   mutex_lock(_core_mutex);
+   mutex_lock(_core_tasks_mutex);
 
/*
 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9954,10 +9956,51 @@ static int sched_core_share_tasks(struct task_struct 
*t1, struct task_struct *t2
 
ret = 0;
 out_unlock:
-   mutex_unlock(_core_mutex);
+   mutex_unlock(_core_tasks_mutex);
return ret;
 }
 
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+   struct task_struct *task;
+   int err;
+
+   if (pid == 0) { /* Recent current task's cookie. */
+   /* Resetting a cookie requires privileges. */
+   if (current->core_task_cookie)
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   task = NULL;
+   } else {
+   rcu_read_lock();
+   task = pid ? find_task_by_vpid(pid) : current;
+   if (!task) {
+   rcu_read_unlock();
+   return -ESRCH;
+   }
+
+   get_task_struct(task);
+
+   /*
+* Check if this process has the right to modify the specified
+* process. Use the regular "ptrace_may_access()" checks.
+*/
+   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+   rcu_read_unlock();
+   err = -EPERM;
+   goto out_put;
+   }
+   rcu_read_unlock();
+   }
+
+   err = sched_core_share_tasks(current, task);
+out_put:
+   if (task)
+   put_task_struct(task);
+   return err;
+}
+
 /* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..61a3c98e36de 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
 
error = (current->flags & PR_IO

[PATCH -tip 24/32] sched: Release references to the per-task cookie on exit

2020-11-17 Thread Joel Fernandes (Google)
During exit, we have to free the references to a cookie that might be shared by
many tasks. This commit therefore ensures when the task_struct is released, any
references to cookies that it holds are also released.

Reviewed-by: Chris Hyser 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h | 3 +++
 kernel/fork.c | 1 +
 kernel/sched/core.c   | 8 
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 79d76c78cc8e..6fbdb1a204bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,11 +2084,14 @@ void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
 int sched_core_share_pid(pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a95898c75bdf..cc36c384364e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10066,6 +10066,14 @@ static int cpu_core_tag_write_u64(struct 
cgroup_subsys_state *css, struct cftype
 
return 0;
 }
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+   if (!tsk->core_task_cookie)
+   return;
+   sched_core_put_task_cookie(tsk->core_task_cookie);
+   sched_core_put();
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
-- 
2.29.2.299.gdc1121823c-goog



  1   2   3   4   5   6   7   8   9   10   >