Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-05-13 Thread Michal Koutný
On Tue, Apr 09, 2024 at 11:01:26AM GMT, Steven Rostedt  
wrote:
> > -   tpid = pid & (PID_MAX_DEFAULT - 1);
> > +   tpid = pid % PID_MAP_SIZE;
> 
> Does that compile to the same? This is a fast path.

I didn't check.
If fast is the intetion, I would change it to something
like BUILD_BUG_ON(!(PID_MAP_SIZE % 2)) and keep the bit operation
without reliance on compiler optimizations.

Thanks for the response (I may not follow up on this single commit
though).

Michal


signature.asc
Description: PGP signature


Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-05-13 Thread Michal Koutný
On Mon, Apr 08, 2024 at 04:58:18PM GMT, Michal Koutný  wrote:
> The kernel provides mechanisms, while it should not imply policies --
> default pid_max seems to be an example of the policy that does not fit
> all. At the same time pid_max must have some value assigned, so use the
> end of the allowed range -- pid_max_max.
> 
> This change thus increases initial pid_max from 32k to 4M (x86_64
> defconfig).

Out of curiosity I dug out the commit
acdc721fe26d ("[PATCH] pid-max-2.5.33-A0") v2.5.34~5
that introduced the 32k default. The commit message doesn't say why such
a sudden change though.
Previously, the limit was 1G of pids (i.e. effectively no default limit
like the intention of this series).

Honestly, I expected more enthusiasm or reasons against removing the
default value of pid_max. Is this really not of interest to anyone?

(Thanks, Andrew, for your responses. I don't plan to pursue this further
should there be no more interest in having less default limit values in
kernel.)

Regards,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-12 Thread Michal Koutný
On Thu, Apr 11, 2024 at 03:03:31PM -0700, Andrew Morton 
 wrote:
> A large increase in the maximum number of processes.

The change from (some) default to effective infinity is the crux of the
change. Because that is only a number.
(Thus I don't find the number's 12700% increase alone a big change.)

Actual maximum amount of processes is "workload dependent" and hence
should be determined based on the particular workload.

> Or did I misinterpret?

I thought you saw an issue with projection of that number into sizings
based on the default. Which of them comprises the large change in your
eyes?

Thanks,
Michal



Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-11 Thread Michal Koutný
Hello.

On Mon, Apr 08, 2024 at 01:29:55PM -0700, Andrew Morton 
 wrote:
> That seems like a large change.

In what sense is it large?

I tried to lookup the code parts that depend on this default and either
add the other patches or mention the impact (that part could be more
thorough) in the commit message.

> It isn't clear why we'd want to merge this patchset.  Does it improve
> anyone's life and if so, how?

- kernel devs who don't care about policy
  - policy should be decided by distros/users, not in kernel

- users who need many threads
  - current default is too low
  - this is one more place to look at when configuring

- users who want to prevent fork-bombs
  - current default is ineffective (too high), false feeling of safety
  - i.e. they should configure appropriate mechanism appropriately


I thought that the first point alone would be convincing and that only
scaling impact might need clarification.

Regards,
Michal



Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Michal Koutný
On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang 
 wrote:
> It's always non-NULL based on testing.
> 
> It's hard for me to say definitely by reading the code. But IIUC
> cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user
> space can't set up controllers and config limits, etc., for the diasabled
> ones. Each task->cgroups would still have a non-NULL pointer to the static
> root object for each cgroup that is enabled by KConfig, so
> get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
> regardless 'cgroup_disable=misc'.
> 
> Maybe @Michal or @tj can confirm?

The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal


signature.asc
Description: PGP signature


[PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-08 Thread Michal Koutný
pid_max is a per-pidns (thus global too) limit on a number of tasks the
kernel admits. The knob can be configured by admin in the range between
pid_max_min and pid_max_max (sic). The default value sits between
those and it typically equals max(32k, 1k*nr_cpus).

The nr_cpu scaling was introduced in commit 72680a191b93 ("pids:
increase pid_max based on num_possible_cpus") to accommodate kernel's own
helper tasks (before workqueues). Generally, 1024 tasks/cpu cap is too
much if they were all running and it is also too little when they are
idle (memory being bottleneck).

The kernel also provides other mechanisms to restrict number of tasks --
threads-max sysctl and RLIMIT_NPROC with memory-scaled defaults and
generic pids cgroup controller (the last one being the solution of
fork-bombs, with qualified limits set up by admin).

The kernel provides mechanisms, while it should not imply policies --
default pid_max seems to be an example of the policy that does not fit
all. At the same time pid_max must have some value assigned, so use the
end of the allowed range -- pid_max_max.

This change thus increases initial pid_max from 32k to 4M (x86_64
defconfig).

This has effect on size of structure that alloc_pid/idr_alloc_cyclic
eventually uses and structure that kernel tracing uses with
'record-tgid' (~16 MiB).

Signed-off-by: Michal Koutný 
---
 include/linux/pid.h |  4 ++--
 include/linux/threads.h | 15 ---
 kernel/pid.c|  8 +++-
 3 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index a3aad9b4074c..0d191ac02958 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -106,8 +106,8 @@ extern void exchange_tids(struct task_struct *task, struct 
task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 enum pid_type);
 
-extern int pid_max;
-extern int pid_max_min, pid_max_max;
+extern int pid_max_min, pid_max;
+extern const int pid_max_max;
 
 /*
  * look up a PID in the hash table. Must be called with the tasklist_lock
diff --git a/include/linux/threads.h b/include/linux/threads.h
index c34173e6c5f1..43f8f38a0c13 100644
--- a/include/linux/threads.h
+++ b/include/linux/threads.h
@@ -22,25 +22,18 @@
 
 #define MIN_THREADS_LEFT_FOR_ROOT 4
 
-/*
- * This controls the default maximum pid allocated to a process
- */
-#define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)
-
 /*
  * A maximum of 4 million PIDs should be enough for a while.
  * [NOTE: PID/TIDs are limited to 2^30 ~= 1 billion, see FUTEX_TID_MASK.]
  */
 #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
-   (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
+   (sizeof(long) > 4 ? 4 * 1024 * 1024 : 0x8000))
 
 /*
- * Define a minimum number of pids per cpu.  Heuristically based
- * on original pid max of 32k for 32 cpus.  Also, increase the
- * minimum settable value for pid_max on the running system based
- * on similar defaults.  See kernel/pid.c:pid_idr_init() for details.
+ * Define a minimum number of pids per cpu. Mainly to accommodate
+ * smpboot_register_percpu_thread() kernel threads.
+ * See kernel/pid.c:pid_idr_init() for details.
  */
-#define PIDS_PER_CPU_DEFAULT   1024
 #define PIDS_PER_CPU_MIN   8
 
 #endif
diff --git a/kernel/pid.c b/kernel/pid.c
index da76ed1873f7..24ae505ac3b0 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -60,10 +60,10 @@ struct pid init_struct_pid = {
}, }
 };
 
-int pid_max = PID_MAX_DEFAULT;
+int pid_max = PID_MAX_LIMIT;
 
 int pid_max_min = RESERVED_PIDS + 1;
-int pid_max_max = PID_MAX_LIMIT;
+const int pid_max_max = PID_MAX_LIMIT;
 /*
  * Pseudo filesystems start inode numbering after one. We use Reserved
  * PIDs as a natural offset.
@@ -652,9 +652,7 @@ void __init pid_idr_init(void)
/* Verify no one has done anything silly: */
BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
 
-   /* bump default and minimum pid_max based on number of cpus */
-   pid_max = min(pid_max_max, max_t(int, pid_max,
-   PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
+   /* bump minimum pid_max based on number of cpus */
pid_max_min = max_t(int, pid_max_min,
PIDS_PER_CPU_MIN * num_possible_cpus());
pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
-- 
2.44.0




[PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-04-08 Thread Michal Koutný
Calculations into map_pid_to_cmdline use PID_MAX_DEFAULT but they
actually depend on the size of map_pid_to_cmdline. The size of the map
may be arbitrary. First, refer to the map size where necessary, second,
pick a good value for the size of the map.
Since the buffer is allocated at boot (i.e. user cannot affect its size
later), accounting for full PID_MAX_LIMIT would inflate map's size
unnecessarily (4*4M) for all users. Stick to the original value of
4*32k, the commit 785e3c0a3a87 ("tracing: Map all PIDs to command
lines") explains why it still works for higher pids.

The point of this exercise is to remove dependency on PID_MAX_DEFAULT.

Signed-off-by: Michal Koutný 
---
 kernel/trace/trace_sched_switch.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_sched_switch.c 
b/kernel/trace/trace_sched_switch.c
index 8a407adb0e1c..aca2dafdd97a 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -161,6 +161,7 @@ static size_t tgid_map_max;
 
 #define SAVED_CMDLINES_DEFAULT 128
 #define NO_CMDLINE_MAP UINT_MAX
+#define PID_MAP_SIZE (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)
 /*
  * Preemption must be disabled before acquiring trace_cmdline_lock.
  * The various trace_arrays' max_lock must be acquired in a context
@@ -168,7 +169,7 @@ static size_t tgid_map_max;
  */
 static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 struct saved_cmdlines_buffer {
-   unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
+   unsigned map_pid_to_cmdline[PID_MAP_SIZE];
unsigned *map_cmdline_to_pid;
unsigned cmdline_num;
int cmdline_idx;
@@ -248,7 +249,7 @@ int trace_save_cmdline(struct task_struct *tsk)
if (!tsk->pid)
return 1;
 
-   tpid = tsk->pid & (PID_MAX_DEFAULT - 1);
+   tpid = tsk->pid % PID_MAP_SIZE;
 
/*
 * It's not the end of the world if we don't get
@@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[])
return;
}
 
-   tpid = pid & (PID_MAX_DEFAULT - 1);
+   tpid = pid % PID_MAP_SIZE;
map = savedcmd->map_pid_to_cmdline[tpid];
if (map != NO_CMDLINE_MAP) {
tpid = savedcmd->map_cmdline_to_pid[map];
@@ -645,8 +646,8 @@ tracing_saved_cmdlines_size_write(struct file *filp, const 
char __user *ubuf,
if (ret)
return ret;
 
-   /* must have at least 1 entry or less than PID_MAX_DEFAULT */
-   if (!val || val > PID_MAX_DEFAULT)
+   /* must have at least 1 entry or fit into map_pid_to_cmdline */
+   if (!val || val >= PID_MAP_SIZE)
return -EINVAL;
 
ret = tracing_resize_saved_cmdlines((unsigned int)val);
-- 
2.44.0




[PATCH 3/3] tracing: Compare pid_max against pid_list capacity

2024-04-08 Thread Michal Koutný
trace_pid_list_alloc() checks pid_max against a magic number referencing
an (obsolete) source file when it actually should check against the
capacity of pid_list tree. Turn definition of MAX_PID around -- derive
it from tree parameters and replace references to magic value and
header files with so defined MAX_PID.
Should PID_MAX_LIMIT change in future or pid_max escapes PID_MAX_LIMIT,
appropriate checks remain in place. No functional change intended.

Signed-off-by: Michal Koutný 
---
 kernel/trace/pid_list.c | 6 +++---
 kernel/trace/pid_list.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
index 95106d02b32d..b968f0b65dc1 100644
--- a/kernel/trace/pid_list.c
+++ b/kernel/trace/pid_list.c
@@ -93,7 +93,7 @@ static inline bool upper_empty(union upper_chunk *chunk)
 static inline int pid_split(unsigned int pid, unsigned int *upper1,
 unsigned int *upper2, unsigned int *lower)
 {
-   /* MAX_PID should cover all pids */
+   /* MAX_PID must cover all possible pids */
BUILD_BUG_ON(MAX_PID < PID_MAX_LIMIT);
 
/* In case a bad pid is passed in, then fail */
@@ -413,8 +413,8 @@ struct trace_pid_list *trace_pid_list_alloc(void)
struct trace_pid_list *pid_list;
int i;
 
-   /* According to linux/thread.h, pids can be no bigger that 30 bits */
-   WARN_ON_ONCE(pid_max > (1 << 30));
+   /* See pid_split(), equal to pid_max > PID_MAX_LIMIT */
+   WARN_ON_ONCE(pid_max > MAX_PID);
 
pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL);
if (!pid_list)
diff --git a/kernel/trace/pid_list.h b/kernel/trace/pid_list.h
index 62e73f1ac85f..28562a9a3d01 100644
--- a/kernel/trace/pid_list.h
+++ b/kernel/trace/pid_list.h
@@ -56,8 +56,8 @@
 
 #define UPPER_MASK (UPPER_MAX - 1)
 
-/* According to linux/thread.h pids can not be bigger than or equal to 1 << 30 
*/
-#define MAX_PID(1 << 30)
+/* Structure can hold only pids strictly below this limit */
+#define MAX_PID(1 << (UPPER_BITS + UPPER_BITS + LOWER_BITS))
 
 /* Just keep 6 chunks of both upper and lower in the cache on alloc */
 #define CHUNK_ALLOC 6
-- 
2.44.0




[PATCH 0/3] kernel/pid: Remove default pid_max value

2024-04-08 Thread Michal Koutný
TL;DR excerpt from commit 02/03:

The kernel provides mechanisms, while it should not imply policies --
default pid_max seems to be an example of the policy that does not fit
all. At the same time pid_max must have some value assigned, so use the
end of the allowed range -- pid_max_max.

More details are in that commit's message. The other two commits are
related preparation and less related refresh in code that somewhat
references pid_max.

Michal Koutný (3):
  tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT
  kernel/pid: Remove default pid_max value
  tracing: Compare pid_max against pid_list capacity

 include/linux/pid.h   |  4 ++--
 include/linux/threads.h   | 15 ---
 kernel/pid.c  |  8 +++-
 kernel/trace/pid_list.c   |  6 +++---
 kernel/trace/pid_list.h   |  4 ++--
 kernel/trace/trace_sched_switch.c | 11 ++-
 6 files changed, 20 insertions(+), 28 deletions(-)


base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
-- 
2.44.0




Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Michal Koutný
On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang 
 wrote:
> Do we really want to have it implemented in c?

I only pointed to the available C boilerplate.

> There are much fewer lines of
> code in shell scripts. Note we are not really testing basic cgroup stuff.
> All we needed were creating/deleting cgroups and set limits which I think
> have been demonstrated feasible in the ash scripts now.

I assume you refer to
Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com>
right?

Could it be even simpler if you didn't stick to cgtools APIs and v1
compatibility?

Reducing ash_cgexec.sh to something like
echo 0 >$R/$1/cgroup.procs
shift
exec "$@"
(with some small builerplate for $R and previous mkdirs)


Thanks,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Michal Koutný
Hello.

On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen  
wrote:
> > > It'd be more complicated and less readable to do all the stuff without 
> > > the  
> > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends 
> > >  
> > > on libc so I hope this would not cause too much inconvenience.
> >
> > As per cgroup-tools, please prove this. It makes the job for more
> > complicated *for you* and you are making the job more  complicated
> > to every possible person in the planet running any kernel QA.
> >
> > I weight the latter more than the former. And it is exactly the
> > reason why we did custom user space kselftest in the first place.
> > Let's keep the tradition. All I can say is that kselftest is 
> > unfinished in its current form.
> >
> > What is "esp cgexec"?
> 
> Also in kselftest we don't drive ultimate simplicity, we drive
> efficient CI/QA. By open coding something like subset of
> cgroup-tools needed to run the test you also help us later
> on to backtrack the kernel changes. With cgroups-tools you
> would have to use strace to get the same info.

FWIW, see also functions in
tools/testing/selftests/cgroup/cgroup_util.{h,c}.
They likely cover what you need already -- if the tests are in C.

(I admit that stuff in tools/testing/selftests/cgroup/ is best
understood with strace.)

HTH,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-27 Thread Michal Koutný
Hello.

On Mon, Feb 26, 2024 at 03:48:18PM -0600, Haitao Huang 
 wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC capacity,
> if one group has a fault, and its usage is not above its own limit
> (try_charge() passes), yet total usage of the system has exceeded the
> capacity, whether we do global reclaim or just reclaim pages in the current
> faulting group.
> 
> > Also, what does the core mm memcg code do?
> > 
> I'm not sure. I'll try to find out but it'd be appreciated if someone more
> knowledgeable can comment on this. memcg also has the protection mechanism
> (i.e., min, low settings) to guarantee some allocation per group so its
> approach might not be applicable to misc controller here.

I only follow the discussion superficially but it'd be nice to have
analogous mechanisms in memcg and sgx controller.

The memory limits are rather simple -- when allocating new memory, the
tightest limit of ancestor applies and reclaim is applied to whole
subtree of such an ancestor (not necessearily the originating cgroup of
the allocation). Overcommit is admited, competition among siblings is
resolved on the first comes, first served basis.

The memory protections are an additional (and in a sense orthogoal)
mechanism. They can be interpretted as limits that are enforced not at
the time of allocation but at the time of reclaim (and reclaim is
triggered independetly, either global or with the limits above). The
consequence is that the protection code must do some normalization to
evaluate overcommited values sensibly.

(Historically, there was also memory.soft_limit_in_bytes, which combined
"protection" and global reclaim but it turned out broken. I suggest
reading Documentation/admin-guide/cgroup-v2.rst section Controller
Issues and Remedies/Memory for more details and as cautionary example.)

HTH,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-02-26 Thread Michal Koutný
On Mon, Feb 05, 2024 at 01:06:27PM -0800, Haitao Huang 
 wrote:
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_res_ops sgx_epc_cgroup_ops = {
> + .alloc = sgx_epc_cgroup_alloc,
> + .free = sgx_epc_cgroup_free,
> +};
> +
> +static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> + epc_cg->cg = cg;
> +}

This is a possibly a nit pick but I share it here for consideration.

Would it be more prudent to have the signature like
  alloc(struct misc_res *res, struct misc_cg *cg)
so that implementations are free of the assumption of how cg and res are
stored?


Thanks,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Michal Koutný
On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai"  
wrote:
> I am not sure, but is it possible or legal for an ancestor to have less limit
> than children?

Why not?
It is desired for proper hiearchical delegation and the tightest limit
of ancestors applies (cf memory.max).

Michal


signature.asc
Description: PGP signature


Re: Re: [RFC PATCH] rpmsg: glink: Add bounds check on tx path

2024-01-29 Thread Michal Koutný
On Mon, Jan 29, 2024 at 04:18:36PM +0530, Deepak Kumar Singh 
 wrote:
> There is already a patch posted for similar problem -
> https://lore.kernel.org/all/20231201110631.669085-1-quic_dee...@quicinc.com/

I was not aware, thanks for the pointer.

Do you plan to update your patch to "just" bail-out/zero instead of
using slightly random values (as pointed out by Bjorn)?

Michal


signature.asc
Description: PGP signature


[RFC PATCH] rpmsg: glink: Add bounds check on tx path

2024-01-12 Thread Michal Koutný
Add bounds check on values read from shared memory in the tx path. In
cases where the VM is misbehaving, the transport should exit and print a
warning when bogus values may cause out of bounds to be read.

Link: 
https://git.codelinaro.org/clo/la/kernel/msm-5.10/-/commit/32d9c3a2f2b6a4d1fc48d6871194f3faf3184e8b
Suggested-by: Chris Lew 
Suggested-by: Sarannya S 
Signed-off-by: Michal Koutný 
---
 drivers/rpmsg/qcom_glink_smem.c | 9 +
 1 file changed, 9 insertions(+)

Why RFC? The patch is adopted from the link above. It would be good to
asses whether such conditions can also happen with rpmsg glink.
(And if so, whether the zeroed values are the best correction.)

diff --git a/drivers/rpmsg/qcom_glink_smem.c b/drivers/rpmsg/qcom_glink_smem.c
index 7a982c60a8dd..3e786e590c03 100644
--- a/drivers/rpmsg/qcom_glink_smem.c
+++ b/drivers/rpmsg/qcom_glink_smem.c
@@ -146,6 +146,11 @@ static size_t glink_smem_tx_avail(struct qcom_glink_pipe 
*np)
else
avail -= FIFO_FULL_RESERVE + TX_BLOCKED_CMD_RESERVE;
 
+   if (avail > pipe->native.length) {
+   pr_warn_once("%s: avail clamped\n", __func__);
+   avail = 0;
+   }
+
return avail;
 }
 
@@ -177,6 +182,10 @@ static void glink_smem_tx_write(struct qcom_glink_pipe 
*glink_pipe,
unsigned int head;
 
head = le32_to_cpu(*pipe->head);
+   if (head > pipe->native.length) {
+   pr_warn_once("%s: head overflow\n", __func__);
+   return;
+   }
 
head = glink_smem_tx_write_one(pipe, head, hdr, hlen);
head = glink_smem_tx_write_one(pipe, head, data, dlen);
-- 
2.43.0




Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events

2024-01-05 Thread Michal Koutný
On Mon, Oct 30, 2023 at 11:20:02AM -0700, Haitao Huang 
 wrote:
> From: Kristen Carlson Accardi 
> 
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed.
> 
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver.

I notice now that the callbacks are per resource and per cgroup.
I think that is an overkill that complicates misc_cg allocation code
with parent's callbacks while freeing is done by own callbacks.
It's possibly too much liberal to keep objects' lifecycle understandable
in the long term too. 

For some uniformity, I'd suggest struct blkcg_policy array in
block/blk-cgroup.c as the precedent. (Perhaps indexed with static enum
misc_res_type instead of dynamic index assignment as
blkcg_policy_registeer() does.)

I hope you don't really need per-cgroup callbacks :-)

Michal


signature.asc
Description: PGP signature


Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Michal Koutný
Hello.

On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang 
 wrote:
> Thanks for raising this. Actually my understanding the above discussion was
> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
> "reclaiming" within that context only meant for that particular kind.
> 
> As Mikko pointed out, without reclaiming per-cgroup, the max limit of each
> cgroup needs to accommodate the peak usage of enclaves within that cgroup.
> That may be inconvenient for practical usage and limits could be forced to
> be set larger than necessary to run enclaves performantly. For example, we
> can observe following undesired consequences comparing a system with current
> kernel loaded with enclaves whose total peak usage is greater than the EPC
> capacity.
> 
> 1) If a user wants to load the same exact enclaves but in separate cgroups,
> then the sum of cgroup limits must be higher than the capacity and the
> system will end up doing the same old global reclaiming as it is currently
> doing. Cgroup is not useful at all for isolating EPC consumptions.

That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).

> 2) To isolate impact of usage of each cgroup on other cgroups and yet still
> being able to load each enclave, the user basically has to carefully plan to
> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
> enclaves, is not reaching over the capacity. That means no over-commiting
> allowed and the same system may not be able to load as many enclaves as with
> current kernel.

Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.

My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.

HTH,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Michal Koutný
On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen  
wrote:
> 1. Admin sets a limit
> 2. Enclave is created
> 3. Enclave hits limit, allocation fails

I was actually about to suggest reorganizing the series to a part
implementing this simple limiting and a subsequent part with the reclaim
stuff for easier digestability. 

> Nothing else matters.

If the latter part is an unncessary overkill, it's even better.

Thanks,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Michal Koutný
On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný  
wrote:
> Is this distinction between preemptability of EPC pages mandated by the
> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
> users an option to lock certain pages in memory that yields this
> difference?

(After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")

Or would these two types warrant also two types of miscresource? (To
deal with each in own way.)

Thanks,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events

2023-10-17 Thread Michal Koutný
On Fri, Sep 22, 2023 at 08:06:40PM -0700, Haitao Huang 
 wrote:
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct 
> kernfs_open_file *of, char *buf,
>  
>   cg = css_misc(of_css(of));
>  
> - if (READ_ONCE(misc_res_capacity[type]))
> + if (READ_ONCE(misc_res_capacity[type])) {
>   WRITE_ONCE(cg->res[type].max, max);
> - else
> + if (cg->res[type].max_write)
> + cg->res[type].max_write(cg);
> + } else {
>   ret = -EINVAL;
>
> + }

Is it time for a misc_cg_mutex? This given no synchronization guarantees
to implementors of max_write. (Alternatively, document it that the
callback must implement own synchronization.)


Thanks,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-17 Thread Michal Koutný
On Fri, Sep 22, 2023 at 08:06:55PM -0700, Haitao Huang 
 wrote:
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg);

It should check for !epc_cg since the misc controller implementation
in misc_cg_alloc() would roll back even on non-allocated resources.

> + cancel_work_sync(_cg->reclaim_work);
> + kfree(epc_cg);
> +}
> +
> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
> +{
> + struct sgx_epc_reclaim_control rc;
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +
> + sgx_epc_reclaim_control_init(, epc_cg);
> + /* Let the reclaimer to do the work so user is not blocked */
> + queue_work(sgx_epc_cg_wq, _cg->reclaim_work);

This is weird. The writer will never learn about the result of the
operation.

Thanks,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Michal Koutný
Hello Haitao.

On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang 
 wrote:
> AFAIK, before we introducing max_write() callback in this series, no misc
> controller would possibly enforce the limit when misc.max is reduced. e.g. I
> don't think CVMs be killed when ASID limit is reduced and the cgroup was
> full before limit is reduced.

Yes, misccontroller was meant to be simple, current >= max serves to
prevent new allocations.

FTR, at some point in time memory.max was considered for reclaim control
of regular pages but it turned out to be too coarse (and OOM killing
processes if amount was not sensed correctly) and this eventually
evolved into specific mechanism of memory.reclaim.
So I'm mentioning this should that be an interface with better semantic
for your use case (and misc.max writes can remain non-preemptive).

One more note -- I was quite confused when I read in the rest of the
series about OOM and _kill_ing but then I found no such measure in the
code implementation. So I would suggest two terminological changes:

- the basic premise of the series (00/18) is that EPC pages are a
  different resource than memory, hence choose a better suiting name
  than OOM (out of memory) condition,
- killing -- (unless you have an intention to implement process
  termination later) My current interpretation that it is rather some
  aggressive unmapping within address space, so less confusing name for
  that would be "reclaim".


> I think EPC pages to VMs could have the same behavior, once they are given
> to a guest, never taken back by the host. For enclaves on host side, pages
> are reclaimable, that allows us to enforce in a similar way to memcg.

Is this distinction between preemptability of EPC pages mandated by the
HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
users an option to lock certain pages in memory that yields this
difference?

Regards,
Michal


signature.asc
Description: PGP signature


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-08 Thread Michal Koutný
On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra  
wrote:
> IMO as long as cgroups have that tasks file, you get to support people
> using it. That means that tasks joining your cgroup need to 'inherit'
> cgroup properties.
The tasks file is consequence of binding this to cgroups, I'm one step
back. Why to make "core isolation" a cgroup property?

(I understand this could help "visualize" what the common domains are if
cgroups were the only API but with prctl the structure can be
arbitrarily modified anyway.)


> Given something like:
> 
> R
>/ \
>   A   B
>  / \
> C   D
Thanks for the example. 

> B group can set core_sched=1 and then all its (and its decendants) tasks
> get to have the same (group) cookie and cannot share with others.
The same could be achieved with the first task of group B allocating its
new cookie which would be inherited in its descednants.

> If however B is a delegate and has a subgroup D that is security
> sensitive and must not share core resources with the rest of B, then it
> can also set D.core_sched=1, such that D (and its decendants) will have
> another (group) cookie.
If there is such a sensitive descendant task, it could allocate a new
cookie (same way as the first one in B did).

> On top of this, say C has a Real-Time tasks, that wants to limit SMT
> interference, then it can set a (task/prctl) cookie on itself, such that
> it will not share the core with the rest of the tasks of B.
(IIUC, in this particular example it'd be redundant if B had no inner
tasks since D isolated itself already.)
Yes, so this is again the same pattern as the tasks above have done.

> In that scenario the D subtree is a restriction (doesn't share) with the
> B subtree.
This implies D's isolation from everything else too, not just B's
members, no?

> And all of B is a restriction on all its tasks, including the Real-Time
> task that set a task cookie, in that none of them can share with tasks
> outside of B (including system tasks which are in R), irrespective of
> what they do with their task cookie.
IIUC, the equivalent restriction could be achieved with the PTRACE-like
check in the prctl API too (with respectively divided uids).

I'm curious whether the cgroup API actually simplifies things that are
possible with the clone/prctl API or allows anything that wouldn't be
otherwise possible.

Regards,
Michal



signature.asc
Description: Digital signature


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-07 Thread Michal Koutný
Hello.

IIUC, the premise is that the tasks that have different cookies imply
they would never share a core.

On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It 
> is
> however changed such that you can have nested tags. The for any given task, 
> the
> first parent with a cookie is the effective one. The rationale is that this 
> way
> you can delegate subtrees and still allow them some control over grouping.

Given the existence of prctl and clone APIs, I don't see the reason to
have a separate cgroup-bound interface too (as argued by Tejun). The
potential speciality is the possibility to re-tag whole groups of
processes at runtime (but the listed use cases [1] don't require that
and it's not such a good idea given its asynchronicity).

Also, I would find useful some more explanation how the hierarchical
behavior is supposed to work. In my understanding the task is either
allowed to request its own isolation or not. The cgroups could be used
to restrict this privilege, however, that doesn't seem to be the case
here.

My two cents,
Michal

[1] https://lore.kernel.org/lkml/20200822030155.ga414...@google.com/


signature.asc
Description: Digital signature


Re: [PATCH 06/11] cgroup: fix -Wzero-length-bounds warnings

2021-03-30 Thread Michal Koutný
On Tue, Mar 30, 2021 at 11:00:36AM +0200, Arnd Bergmann  wrote:
> Would it be possible to enclose most or all of kernel/cgroup/cgroup.c
> in an #ifdef CGROUP_SUBSYS_COUNT block?
Even without any controllers, there can still be named hierarchies (v1)
or the default hierarchy (v2) (for instance) for process tracking
purposes. So only parts of kernel/cgroup/cgroup.c could be ifdef'd.

Beware that CGROUP_SUBSYS_COUNT is not known at preprocessing stage (you
could have a macro alternative though).

> I didn't try that myself, but this might be a way to guarantee that
> there cannot be any callers (it would cause a link error).
Such a guarantee would be nicer, I agree. I tried a bit but anandoned it
when I saw macros proliferate (which I found less readable than your
current variant). But YMMV.

Michal


signature.asc
Description: Digital signature


Re: [PATCH 06/11] cgroup: fix -Wzero-length-bounds warnings

2021-03-30 Thread Michal Koutný
On Mon, Mar 22, 2021 at 05:02:44PM +0100, Arnd Bergmann  wrote:
> I'm not sure what is expected to happen for such a configuration,
> presumably these functions are never calls in that case.
Yes, the functions you patched would only be called from subsystems or
there should be no way to obtain a struct cgroup_subsys reference
anyway (hence it's ok to always branch as if ss==NULL).

I'd prefer a variant that wouldn't compile the affected codepaths when
there are no subsystems registered, however, I couldn't come up with a
way how to do it without some preprocessor ugliness.

Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH] mm: memcontrol: switch to rstat fix

2021-03-17 Thread Michal Koutný
On Mon, Mar 15, 2021 at 07:41:00PM -0400, Johannes Weiner  
wrote:
> Switch to the atomic variant, cgroup_rstat_irqsafe().
Congratulations(?), the first use of cgroup_rstat_irqsafe().

Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [Patch v3 0/2] cgroup: New misc cgroup controller

2021-03-15 Thread Michal Koutný
On Fri, Mar 12, 2021 at 09:49:26AM -0800, Vipin Sharma  
wrote:
> I will add some more information in the cover letter of the next version.
Thanks.

> Each one coming up with their own interaction is a duplicate effort
> when they all need similar thing.
Could this be expressed as a new BPF hook (when allocating/freeing such
a resource unit)?

The decision could be made based on the configured limit or even some
other predicate.

(I saw this proposed already but I haven't seen some more reasoning
whether it's worse/better. IMO, BPF hooks are "cheaper" than full-blown
controllers, though it's still new user API.)


> As per my understanding this is the only for way for loadable modules
> (kvm-amd in this case) to access Kernel APIs. Let me know if there is a
> better way to do it.
I understood the symbols are exported for such modularized builds.
However, making them non-GPL exposes them to any out-of-tree modules,
although, the resource types are supposed to stay hardcoded in the misc
controller. So my point was to make them EXPORT_SYMBOL_GPL to mark
they're just a means of implementing the modularized builds and not an
API. (But they'd remain API for out-of-tree GPL modules anyway, so take
this reasoning of mine with a grain of salt.)

Michal


signature.asc
Description: Digital signature


Re: [Patch v3 1/2] cgroup: sev: Add misc cgroup controller

2021-03-15 Thread Michal Koutný
On Fri, Mar 12, 2021 at 11:07:14AM -0800, Vipin Sharma  
wrote:
> We should be fine without atomic64_t because we are using unsigned
> long and not 64 bit explicitly. This will work on both 32 and 64 bit
> machines.
I see.

> But I will add READ_ONCE and WRITE_ONCE because of potential chances of
> load tearing and store tearing.
> 
> Do you agree?
Yes.

> This was only here to avoid multiple reads of capacity and making sure
> if condition and seq_print will see the same value.
Aha.

> Also, I was not aware of load and store tearing of properly aligned
> and machine word size variables. I will add READ_ONCE and WRITE_ONCE
> at other places.
Yeah, although it's theoretical, I think it also serves well to annotate
such unsychronized accesses.

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [Patch v3 1/2] cgroup: sev: Add misc cgroup controller

2021-03-11 Thread Michal Koutný
Hi Vipin.

On Thu, Mar 04, 2021 at 03:19:45PM -0800, Vipin Sharma  
wrote:
>  arch/x86/kvm/svm/sev.c|  65 +-
>  arch/x86/kvm/svm/svm.h|   1 +
>  include/linux/cgroup_subsys.h |   4 +
>  include/linux/misc_cgroup.h   | 130 +++
>  init/Kconfig  |  14 ++
>  kernel/cgroup/Makefile|   1 +
>  kernel/cgroup/misc.c  | 402 ++
Given different two-fold nature (SEV caller vs misc controller) of some
remarks below, I think it makes sense to split this into two patches:
a) generic controller implementation,
b) hooking the controller into SEV ASIDs management.

> +#ifndef CONFIG_KVM_AMD_SEV
> +/*
> + * When this config is not defined, SEV feature is not supported and APIs in
> + * this file are not used but this file still gets compiled into the KVM AMD
> + * module.
> + *
> + * We will not have MISC_CG_RES_SEV and MISC_CG_RES_SEV_ES entries in the 
> enum
> + * misc_res_type {} defined in linux/misc_cgroup.h.
BTW, was there any progress on conditioning sev.c build on
CONFIG_KVM_AMD_SEV? (So that the defines workaround isn't needeed.)

>  static int sev_asid_new(struct kvm_sev_info *sev)
>  {
> - int pos, min_asid, max_asid;
> + int pos, min_asid, max_asid, ret;
>   bool retry = true;
> + enum misc_res_type type;
> +
> + type = sev->es_active ? MISC_CG_RES_SEV_ES : MISC_CG_RES_SEV;
> + sev->misc_cg = get_current_misc_cg();
> + ret = misc_cg_try_charge(type, sev->misc_cg, 1);
It may be safer to WARN_ON(sev->misc_cg) at this point (see below).


> [...]
> +e_uncharge:
> + misc_cg_uncharge(type, sev->misc_cg, 1);
> + put_misc_cg(sev->misc_cg);
> + return ret;
vvv

> @@ -140,6 +171,10 @@ static void sev_asid_free(int asid)
>   }
>  
>   mutex_unlock(_bitmap_lock);
> +
> + type = sev->es_active ? MISC_CG_RES_SEV_ES : MISC_CG_RES_SEV;
> + misc_cg_uncharge(type, sev->misc_cg, 1);
> + put_misc_cg(sev->misc_cg);
It may be safer to set sev->misc_cg to NULL here.

(IIUC, with current asid_{new,free} calls it shouldn't matter but why to
rely on it in the future.)

> +++ b/kernel/cgroup/misc.c
> [...]
> +static void misc_cg_reduce_charge(enum misc_res_type type, struct misc_cg 
> *cg,
> +   unsigned long amount)
misc_cg_cancel_charge seems to be a name more consistent with what we
already have in pids and memory controller.


> +static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
> +  size_t nbytes, loff_t off)
> +{
> [...]
> +
> + if (!strcmp(MAX_STR, buf)) {
> + max = ULONG_MAX;
MAX_NUM for consistency with other places.

> + } else {
> + ret = kstrtoul(buf, 0, );
> + if (ret)
> + return ret;
> + }
> +
> + cg = css_misc(of_css(of));
> +
> + if (misc_res_capacity[type])
> + cg->res[type].max = max;
In theory, parallel writers can clash here, so having the limit atomic
type to prevent this would resolve it. See also commit a713af394cf3
("cgroup: pids: use atomic64_t for pids->limit").

> +static int misc_cg_current_show(struct seq_file *sf, void *v)
> +{
> + int i;
> + struct misc_cg *cg = css_misc(seq_css(sf));
> +
> + for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> + if (misc_res_capacity[i])
Since there can be some residual charges after removing capacity (before
draining), maybe the condition along the line

if (misc_res_capacity[i] || atomic_long_read(>res[i].usage))

would be more informative for debugging.

> +static int misc_cg_capacity_show(struct seq_file *sf, void *v)
> +{
> + int i;
> + unsigned long cap;
> +
> + for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> + cap = READ_ONCE(misc_res_capacity[i]);
Why is READ_ONCE only here and not in other places that (actually) check
against the set capacity value? Also, there should be a paired
WRITE_ONCCE in misc_cg_set_capacity().


Thanks,
Michal


signature.asc
Description: Digital signature


Re: [Patch v3 0/2] cgroup: New misc cgroup controller

2021-03-11 Thread Michal Koutný
Hello.

On Sun, Mar 07, 2021 at 07:48:40AM -0500, Tejun Heo  wrote:
> Vipin, thank you very much for your persistence and patience.
Yes, and thanks for taking my remarks into account.

> Michal, as you've been reviewing the series, can you please take
> another look and ack them if you don't find anything objectionable?
Honestly, I'm still sitting on the fence whether this needs a new
controller and whether the miscontroller (:-p) is a good approach in the
long term [1].

I admit, I didn't follow the past dicussions completely, however,
(Vipin) could it be in the cover letter/commit messages shortly
summarized why cgroups and a controller were chosen to implement
restrictions of these resources, what were the alternatives any why were
they rejected?

In the previous discussion, I saw the reasoning for the list of the
resources to be hardwired in the controller itself in order to get some
scrutiny of possible changes. That makes sense to me. But with that, is
it necessary to commit to the new controller API via EXPORT_SYMBOL? (I
don't mean this as a licensing question but what the external API should
be (if any).)

Besides the generic remarks above, I'd still suggest some slight
implementation changes, posted inline to the patch.


Thanks,
Michal

[1] Currently, only one thing comes to my mind -- the delegation via
cgroup.subtree_control. The miscontroller may add possibly further
resources whose delegation granularity is bunched up under one entry.


signature.asc
Description: Digital signature


Re: [PATCH v2] sched/cpuacct: Fix charge cpuacct.usage_sys incorrently.

2021-03-02 Thread Michal Koutný
Hello.

(Sorry for necroposting, found this upstream reference only now.)

On Mon, Apr 20, 2020 at 03:04:53PM +0800, Muchun Song 
 wrote:
>  /* Time spent by the tasks of the CPU accounting group executing in ... */
> @@ -339,7 +340,7 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>  {
>   struct cpuacct *ca;
>   int index = CPUACCT_STAT_SYSTEM;
> - struct pt_regs *regs = task_pt_regs(tsk);
> + struct pt_regs *regs = get_irq_regs() ? : task_pt_regs(tsk);
I've read the discussion in [1] but I don't think this approach is
correct either (and I don't know what is better :-/).

I only have a qualitative proof:

host:~ # uname -r
5.10.16-1-default

host:~ # systemd-run -p CPUAccounting=yes sh -c 'time sh -c "i=0 ; while [ 
\"\$i\" -lt 1 ] ; do i=\$((\$i+1)) ; cat /proc/slabinfo >/dev/null ; done" 
; sleep inf'
Running as unit: run-r101b9f53efcb4d2a9bfb65feb6f120ca.service

host:~ # cat 
/sys/fs/cgroup/cpuacct/system.slice/run-r101b9f53efcb4d2a9bfb65feb6f120ca.service/cpuacct.usage{,_user,_sys}
16138535165
14332580468
1805954697

(See that sys/user ~ 0.1)

host:~ # journalctl -u run-r101b9f53efcb4d2a9bfb65feb6f120ca.service
-- Logs begin at Tue 2021-03-02 18:06:41 CET, end at Tue 2021-03-02 18:27:45 
CET. --
Mar 02 18:27:29 host systemd[1]: Started /usr/bin/sh -c time sh -c "i=0 ; while 
[ \"\$i\" -lt 1 ] ; do i=\$((\$i+1)) ; cat /proc/slabinfo >/dev/null ; 
done" ; sleep inf.
Mar 02 18:27:45 host sh[19117]: real0m15.543s
Mar 02 18:27:45 host sh[19117]: user0m10.752s
Mar 02 18:27:45 host sh[19117]: sys0m5.379s

(See that sys/user ~ 0.5)

host:~ # cat 
/sys/fs/cgroup/cpuacct/system.slice/run-r101b9f53efcb4d2a9bfb65feb6f120ca.service/cpuacct.stat
user 415
system 1209

(See that sys/user ~ 3.0 :-o)

The expectation is that significant amount of the loop is spent in
kernel (dumping slabinfo). I can't tell which of the ratios fits the
reality best but the cpuacct.usage_sys still seems too low.


Michal

[1] 
https://lore.kernel.org/lkml/20200416141833.50663-1-songmuc...@bytedance.com/


signature.asc
Description: Digital signature


Re: [RFC 1/2] cgroup: sev: Add misc cgroup controller

2021-02-26 Thread Michal Koutný
On Thu, Feb 25, 2021 at 11:28:46AM -0800, Vipin Sharma  
wrote:
> My approach here is that it is the responsibility of the caller to:
> 1. Check the return value and proceed accordingly.
> 2. Ideally, let all of the usage be 0 before deactivating this resource
>by setting capacity to 0
If the calling side can ensure itself that no new units of the resource
are used from that moment on, then it can work this way -- but describe
that in misc_cg_set_capacity() comment.

> Is the above change good?
I think both alternatives would work. But the latter (as I see it now)
would mandate dependency on CONFIG_CGROUP or it'd have to double the
similar logic itself. So maybe keeping the caller responsible explicitly
is simpler from this POV.

> Will there be any objection to extra information?
IMO it's unnecessary (stating this just for consistency reasons), no
strong opinion.

Michal


signature.asc
Description: Digital signature


[RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance

2021-02-25 Thread Michal Koutný
leaf_cfs_rq_list should contain cfs_rqs that have runnable entities in
them.  When we're operating under a throttled hierarchy we always update
the leaf_cfs_rq_list in order no to break list_add_leaf_cfs_rq invariant
of adding complete branches.

This caused troubles when an entity became runnable (enqueue_entity)
under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix
enqueue_task_fair() warning some more")). While it proved well, this
new change ignores updating leaf_cfs_rq_list when we're operating under
the throttled hierarchy and defers the leaf_cfs_rq_list update to the
point when whole hiearchy is unthrottled (tg_unthrottle_up).

The code is now simpler and leaf_cfs_rq_list contains only cfs_rqs that
are truly runnable.

Why is this RFC?
- Primarily, I'm not sure I interpreted the purpose of leaf_cfs_rq_list
  right. The removal of throttled cfs_rqs from it would exclude them
  from __update_blocked_fair() calculation and I can't see past it now.
  If it's wrong assumption, I'd like this to help clarify what the
  proper definition of leaf_cfs_rq_list would be.
- Additionally, I didn't check thoroughly for corner cases when
  se->on_rq => cfs_rq_of(se)->on_list wouldn't hold, so the patch
  certainly isn't finished.

Signed-off-by: Michal Koutný 
---
 kernel/sched/fair.c  | 41 ++---
 kernel/sched/sched.h |  4 +---
 2 files changed, 7 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..634ba6637824 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4250,10 +4250,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
 
/*
 * When bandwidth control is enabled, cfs might have been removed
-* because of a parent been throttled but cfs->nr_running > 1. Try to
-* add it unconditionnally.
+* because of a parent been throttled. We'll add it later (with
+* complete branch up to se->on_rq/cfs_eq->on_list) in
+* tg_unthrottle_up() and unthrottle_cfs_rq().
 */
-   if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
+   if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
 
if (cfs_rq->nr_running == 1)
@@ -4859,6 +4860,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+   // XXX: se->on_rq implies cfs_rq_of(se)->on_list (unless 
throttled_hierarchy)
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
 
@@ -4896,17 +4898,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
add_nr_running(rq, task_delta);
 
 unthrottle_throttle:
-   /*
-* The cfs_rq_throttled() breaks in the above iteration can result in
-* incomplete leaf list maintenance, resulting in triggering the
-* assertion below.
-*/
-   for_each_sched_entity(se) {
-   cfs_rq = cfs_rq_of(se);
-
-   if (list_add_leaf_cfs_rq(cfs_rq))
-   break;
-   }
 
assert_list_leaf_cfs_rq(rq);
 
@@ -5518,6 +5509,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+   // XXX: se->on_rq implies cfs_rq_of(se)->on_list (unless 
throttled_hierarchy)
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
 
@@ -5544,13 +5536,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
-
-   /*
-* One parent has been throttled and cfs_rq removed from the
-* list. Add it back to not break the leaf list.
-*/
-   if (throttled_hierarchy(cfs_rq))
-   list_add_leaf_cfs_rq(cfs_rq);
}
 
/* At this point se is NULL and we are at root level*/
@@ -5574,20 +5559,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
update_overutilized_status(rq);
 
 enqueue_throttle:
-   if (cfs_bandwidth_used()) {
-   /*
-* When bandwidth control is enabled; the cfs_rq_throttled()
-* breaks in the above iteration can result in incomplete
-* leaf list maintenance, resulting in triggering the assertion
-* below.
-*/
-   for_each_sched_entity(se) {
-   cfs_rq = cfs_rq_of(se);
-
-   if (list_add_leaf_cfs_rq(cfs_rq))
-   break;
-   }
-   }
 

Re: [RFC 1/2] cgroup: sev: Add misc cgroup controller

2021-02-25 Thread Michal Koutný
On Wed, Feb 24, 2021 at 08:57:36PM -0800, Vipin Sharma  
wrote:
> This function is meant for hot unplug functionality too.
Then I'm wondering if the current form is sufficient, i.e. the generic
controller can hardly implement preemption but possibly it should
prevent any additional charges of the resource. (Or this can be
implemented the other subsystem and explained in the
misc_cg_set_capacity() docs.)

> Just to be on the same page are you talking about adding an events file
> like in pids?
Actually, I meant just the kernel log message. As it's the simpler part
and even pid events have some inconsistencies wrt hierarchical
reporting.

> However, if I take reference at the first charge and remove reference at
> last uncharge then I can keep the ref count in correct sync.
I see now how it works. I still find it a bit complex. What about making
misc_cg an input parameter and making it the callers responsibility to
keep a reference? (Perhaps with helpers for the most common case.)


Thanks,
Michal


signature.asc
Description: Digital signature


Re: [PATCH] mm: memcontrol: fix slub memory accounting

2021-02-23 Thread Michal Koutný
On Tue, Feb 23, 2021 at 05:24:23PM +0800, Muchun Song 
 wrote:
>  mm/slab_common.c | 4 ++--
>  mm/slub.c| 8 
>  2 files changed, 6 insertions(+), 6 deletions(-)
Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [RFC 1/2] cgroup: sev: Add misc cgroup controller

2021-02-23 Thread Michal Koutný
On Thu, Feb 18, 2021 at 11:55:48AM -0800, Vipin Sharma  
wrote:
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> [...]
> +#ifndef CONFIG_KVM_AMD_SEV
> +/*
> + * When this config is not defined, SEV feature is not supported and APIs in
> + * this file are not used but this file still gets compiled into the KVM AMD
> + * module.
I'm not familiar with the layout of KVM/SEV compile targets but wouldn't
it be simpler to exclude whole svm/sev.c when !CONFIG_KVM_AMD_SEV?

> +++ b/kernel/cgroup/misc.c
> [...]
> +/**
> + * misc_cg_set_capacity() - Set the capacity of the misc cgroup res.
> + * @type: Type of the misc res.
> + * @capacity: Supported capacity of the misc res on the host.
> + *
> + * If capacity is 0 then the charging a misc cgroup fails for that type.
> + *
> + * The caller must serialize invocations on the same resource.
> + *
> + * Context: Process context.
> + * Return:
> + * * %0 - Successfully registered the capacity.
> + * * %-EINVAL - If @type is invalid.
> + * * %-EBUSY - If current usage is more than the capacity.
When is this function supposed to be called? At boot only or is this
meant for some kind of hot unplug functionality too?

> +int misc_cg_try_charge(enum misc_res_type type, struct misc_cg **cg,
> +unsigned int amount)
> [...]
> + new_usage = atomic_add_return(amount, >usage);
> + if (new_usage > res->max ||
> + new_usage > misc_res_capacity[type]) {
> + ret = -EBUSY;
I'm not sure the user of this resource accounting will always be able to
interpret EBUSY returned from depths of the subsystem.
See what's done in pids controller in order to give some useful
information about why operation failed.

> + goto err_charge;
> + }
> +
> + // First one to charge gets a reference.
> + if (new_usage == amount)
> + css_get(>css);
1) Use the /* comment */ style.
2) You pin the whole path from task_cg up to root (on the first charge).
That's unnecessary since children reference their parents.
Also why do you get the reference only for the first charger? While it
may work, it seems too convoluted to me.
It'd be worth documenting what the caller can expect wrt to ref count of
the returned misc_cg.

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [RFC 0/2] cgroup: New misc cgroup controller

2021-02-23 Thread Michal Koutný
Hello.

On Thu, Feb 18, 2021 at 11:55:47AM -0800, Vipin Sharma  
wrote:
> This patch is creating a new misc cgroup controller for allocation and
> tracking of resources which are not abstract like other cgroup
> controllers.
Please don't refer to this as "allocation" anywhere, that has a specific
meaning (see Resource Distribution Models in
Documentation/admin-gruide/cgroup-v2.rst).

> This controller was initially proposed as encryption_id but after
> the feedbacks, it is now changed to misc cgroup.
> https://lore.kernel.org/lkml/20210108012846.4134815-2-vipi...@google.com/
Interesting generalization. I wonder what else could fit under this as
well. (It resembles pids controller on the cover.)

> Please provide any feedback for this RFC or if it is good for
> merging then I can send a patch for merging.
A new controller is added exposed with v1 attributes. I'm not convinced
it is desirable to change the frozen v1 controllers' API? (And therefore
promote it as well.)

Beware, bikeshedding. The name is very non-descriptive, potentially
suggesting catch-all semantics. It'd deserve a further thought. My idea
would be limit(s) or counter controller.


My few cents here (and some more in reply to the patch),
Michal


signature.asc
Description: Digital signature


Re: [PATCH 01/20] cgroup: Manual replacement of the deprecated strlcpy() with return values

2021-02-23 Thread Michal Koutný
Hello.

On Mon, Feb 22, 2021 at 04:12:12PM +0100, Romain Perier 
 wrote:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2265,7 +2265,7 @@ int task_cgroup_path(struct task_struct *task, char 
> *buf, size_t buflen)
Actually, this function isn't used at all. So I'd instead propose the
patch below.

-- >8 --
From 4f7e0b9c0412f60e0b0e8b7d1ef6eb2790dca567 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= 
Date: Tue, 23 Feb 2021 17:05:57 +0100
Subject: [PATCH] cgroup: Drop task_cgroup_path()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The function has no current users and it is a remnant from kdbus
enthusiasm era 857a2beb09ab ("cgroup: implement
task_cgroup_path_from_hierarchy()"). Drop it to eliminate unused code.

Suggested-by: Romain Perier 
Signed-off-by: Michal Koutný 
---
 include/linux/cgroup.h |  1 -
 kernel/cgroup/cgroup.c | 39 ---
 2 files changed, 40 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4f2f79de083e..e9c41b15fd4e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -115,7 +115,6 @@ int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, 
struct cftype *cfts);
 int cgroup_rm_cftypes(struct cftype *cfts);
 void cgroup_file_notify(struct cgroup_file *cfile);
 
-int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 struct pid *pid, struct task_struct *tsk);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c80fe99f85ae..d75ffd461222 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2235,45 +2235,6 @@ int cgroup_path_ns(struct cgroup *cgrp, char *buf, 
size_t buflen,
 }
 EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
-/**
- * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
- * @task: target task
- * @buf: the buffer to write the path into
- * @buflen: the length of the buffer
- *
- * Determine @task's cgroup on the first (the one with the lowest non-zero
- * hierarchy_id) cgroup hierarchy and copy its path into @buf.  This
- * function grabs cgroup_mutex and shouldn't be used inside locks used by
- * cgroup controller callbacks.
- *
- * Return value is the same as kernfs_path().
- */
-int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
-{
-   struct cgroup_root *root;
-   struct cgroup *cgrp;
-   int hierarchy_id = 1;
-   int ret;
-
-   mutex_lock(_mutex);
-   spin_lock_irq(_set_lock);
-
-   root = idr_get_next(_hierarchy_idr, _id);
-
-   if (root) {
-   cgrp = task_cgroup_from_root(task, root);
-   ret = cgroup_path_ns_locked(cgrp, buf, buflen, _cgroup_ns);
-   } else {
-   /* if no hierarchy exists, everyone is in "/" */
-   ret = strlcpy(buf, "/", buflen);
-   }
-
-   spin_unlock_irq(_set_lock);
-   mutex_unlock(_mutex);
-   return ret;
-}
-EXPORT_SYMBOL_GPL(task_cgroup_path);
-
 /**
  * cgroup_migrate_add_task - add a migration target task to a migration context
  * @task: target task
-- 
2.30.1



signature.asc
Description: Digital signature


Re: [PATCH v3 6/8] mm: memcontrol: switch to rstat

2021-02-18 Thread Michal Koutný
On Tue, Feb 09, 2021 at 11:33:02AM -0500, Johannes Weiner  
wrote:
> Rstat combines the best of both worlds: from the write side, it
> cheaply maintains a queue of cgroups that have pending changes, so
> that the read side can do selective tree aggregation. This way the
> reported stats will always be precise and recent as can be, while the
> aggregation can skip over potentially large numbers of idle cgroups.
IIUC, on the unified hierarchy the rstat tree built from writer
notifications is shared with all controllers. (Currently, it means
memory updates are merged with base stats cpu time updates.) This is
just for reference, to stress the current rstat design.

> ---
>  include/linux/memcontrol.h |  67 ++-
>  mm/memcontrol.c| 223 +++--
>  2 files changed, 132 insertions(+), 158 deletions(-)
Well done.

Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH v3 4/8] cgroup: rstat: support cgroup1

2021-02-18 Thread Michal Koutný
On Wed, Feb 17, 2021 at 03:52:59PM -0500, Johannes Weiner  
wrote:
> It's possible, but I don't think worth the trouble.
You're right. I gave it a deeper look and what would be saved on data,
would be paid in code complexity.

> In this case, we're talking about a relatively small data structure
> and the overhead is per mountpoint.
IIUC, it is per each mountpoint's number of cgroups. But I still accept
the argument above. Furthermore, this can be changed later.

> The default root group has statically preallocated percpu data before
> and after this patch. See cgroup.c:
I stand corrected, the comment is still valid.

Therefore,
Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH v3 8/8] kselftests: cgroup: update kmem test for new vmstat implementation

2021-02-18 Thread Michal Koutný
On Tue, Feb 09, 2021 at 11:33:04AM -0500, Johannes Weiner  
wrote:
> ---
>  tools/testing/selftests/cgroup/test_kmem.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
Reviewed-by: Michal Koutný 



signature.asc
Description: Digital signature


Re: [PATCH v3 2/8] mm: memcontrol: kill mem_cgroup_nodeinfo()

2021-02-17 Thread Michal Koutný
On Tue, Feb 09, 2021 at 11:32:58AM -0500, Johannes Weiner  
wrote:
>  include/linux/memcontrol.h |  8 +---
>  mm/memcontrol.c| 21 +++--
>  2 files changed, 12 insertions(+), 17 deletions(-)
Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH v3 3/8] mm: memcontrol: privatize memcg_page_state query functions

2021-02-17 Thread Michal Koutný
On Tue, Feb 09, 2021 at 11:32:59AM -0500, Johannes Weiner  
wrote:
>  include/linux/memcontrol.h | 44 --
>  mm/memcontrol.c| 32 +++
>  2 files changed, 32 insertions(+), 44 deletions(-)
Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH v3 1/8] mm: memcontrol: fix cpuhotplug statistics flushing

2021-02-17 Thread Michal Koutný
On Tue, Feb 09, 2021 at 11:32:57AM -0500, Johannes Weiner  
wrote:
>  mm/memcontrol.c | 35 +--
>  1 file changed, 21 insertions(+), 14 deletions(-)
Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH v3 4/8] cgroup: rstat: support cgroup1

2021-02-17 Thread Michal Koutný
Hello.

On Tue, Feb 09, 2021 at 11:33:00AM -0500, Johannes Weiner  
wrote:
> @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 
> ss_mask)
>   if (ret)
>   goto destroy_root;
>  
> - ret = rebind_subsystems(root, ss_mask);
> + ret = cgroup_rstat_init(root_cgrp);
Would it make sense to do cgroup_rstat_init() only if there's a subsys
in ss_mask that makes use of rstat?
(On legacy systems there could be individual hierarchy for each
controller so the rstat space can be saved.)

> @@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup 
> *parent, const char *name,
>   if (ret)
>   goto out_free_cgrp;
>  
> - if (cgroup_on_dfl(parent)) {
> - ret = cgroup_rstat_init(cgrp);
> - if (ret)
> - goto out_cancel_ref;
> - }
> + ret = cgroup_rstat_init(cgrp);
And here do cgroup_rstat_init() only when parent has it.

> @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
>  
>   for_each_possible_cpu(cpu)
>   raw_spin_lock_init(per_cpu_ptr(_rstat_cpu_lock, cpu));
> -
> - BUG_ON(cgroup_rstat_init(_dfl_root.cgrp));
>  }
Regardless of the suggestion above, this removal obsoletes the comment
cgroup_rstat_init:

 int cpu;
 
-/* the root cgrp has rstat_cpu preallocated */
 if (!cgrp->rstat_cpu) {
 cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);

Regards,
Michal


signature.asc
Description: Digital signature


Re: [PATCH] mm: memcg/slab: optimize objcg stock draining

2021-01-20 Thread Michal Koutný
On Tue, Jan 05, 2021 at 08:22:39PM -0800, Roman Gushchin  wrote:
> Imran Khan reported a regression in hackbench results caused by the
> commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages"). The regression is noticeable in the case of
> a consequent allocation of several relatively large slab objects,
> e.g. skb's. As soon as the amount of stocked bytes exceeds PAGE_SIZE,
> drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
> to a number of atomic operations in page_counter_uncharge().
> 
> The corresponding call graph is below (provided by Imran Khan):
>   |__alloc_skb
>   ||
>   ||__kmalloc_reserve.isra.61
>   |||
>   |||__kmalloc_node_track_caller
>   ||||
>   ||||slab_pre_alloc_hook.constprop.88
>   ||| obj_cgroup_charge
>   |||||
>   |||||__memcg_kmem_charge
>   ||||||
>   ||||||page_counter_try_charge
>   |||||
>   |||||refill_obj_stock
>   ||||||
>   ||||||drain_obj_stock.isra.68   <--- draining old memcg
>   |||||||
>   |||||||__memcg_kmem_uncharge
>   ||||||||
>   ||||||||page_counter_uncharge
>   |||||||||
>   |||||||||page_counter_cancel
> [...]
> - page_counter_uncharge(>memory, nr_pages);
> - if (do_memsw_account())
> - page_counter_uncharge(>memsw, nr_pages);
> + refill_stock(memcg, nr_pages);
I noticed that refill_stock(memcg,...) switches the local stock to
memcg.

In this particular call chain, the uncharged memcg is the old memcg of
stock->cached_objcg. The refill_stock() then may switch stock->cached to
the old memcg too. If the patch leads to better performance, then the
switch probably doesn't happen at this moment (and I guess
stock->cached_objcg and stock->cached can be independent to some extent,
so the old memcg in one needn't be the old in the latter).

In conclusion
Reviewed-by: Michal Koutný 


Michal


signature.asc
Description: Digital signature


Re: [PATCH v3] cgroup-v1: add disabled controller check in cgroup1_parse_param()

2021-01-15 Thread Michal Koutný
On Fri, Jan 15, 2021 at 05:37:17PM +0800, Chen Zhou  
wrote:
> [...]
>  kernel/cgroup/cgroup-v1.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Michal Koutný 


signature.asc
Description: Digital signature


Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()

2021-01-15 Thread Michal Koutný
On Fri, Jan 15, 2021 at 09:55:43AM +0800, chenzhou  
wrote:
> Yeah, this will select all enabled controllers, but which doesn't the 
> behavior we want.
I see what the issue is now.

> See above. Just the mount behavior isn't what we what.
I agree this a bug and your I find your approach correct

Reviewed-by: Michal Koutný 

> The behavior was changed since commit f5dfb5315d34 ("cgroup: take
> options parsing into ->parse_monolithic()"), will add this as Fixes.
Thanks.

Michal


signature.asc
Description: Digital signature


Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()

2021-01-14 Thread Michal Koutný
On Thu, Jan 14, 2021 at 10:08:19PM +0800, chenzhou  
wrote:
> In this case, at the beginning of function check_cgroupfs_options(), the mask
> ctx->subsys_mask will be 0. And if we mount without 'none' and 'name=' 
> options,
> then in check_cgroupfs_options(), the flag ctx->all_ss will be set, that is, 
> select all the subsystems.
But even then, the line
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup-v1.c?h=v5.11-rc3#n1012
would select only 'enabled' controllers, wouldn't it?

It's possible I missed something but if this means that cgroup_no_v1=
doesn't hold to its expectations, I'd suggest adding proper Fixes: tag
to the patch.

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()

2021-01-14 Thread Michal Koutný
Hello Chen.

On Fri, Dec 18, 2020 at 02:17:55PM +0800, Chen Zhou  
wrote:
> When mounting a cgroup hierarchy with disabled controller in cgroup v1,
> all available controllers will be attached.
Not sure if I understand the situation -- have you observed a v1
controller attached to a hierarchy while specifying cgroup_no_v1= kernel
cmdline arg?

AFAICS, the disabled controllers are honored thanks to
check_cgroupfs_options().

Michal


signature.asc
Description: Digital signature


Re: [PATCH] cgroup: Remove unnecessary call to strstrip()

2021-01-14 Thread Michal Koutný
Hello.

On Sun, Jan 03, 2021 at 02:50:01AM +, Hao Lee  
wrote:
> The string buf will be stripped in cgroup_procs_write_start() before it
> is converted to int, so remove this unnecessary call to strstrip().
Good catch, Hao.

Perhaps the code be then simplified a bit

-- >8 --
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= 
Date: Thu, 14 Jan 2021 13:23:39 +0100
Subject: [PATCH] cgroup: cgroup.{procs,threads} factor out common parts
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The functions cgroup_threads_start and cgroup_procs_start are almost
identical. In order to reduce duplication, factor out the common code in
similar fashion we already do for other threadgroup/task functions. No
functional changes are intended.

Suggested-by: Hao Lee 
Signed-off-by: Michal Koutný 
---
 kernel/cgroup/cgroup.c | 55 +++---
 1 file changed, 14 insertions(+), 41 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 613845769103c..f6279df507f4b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4726,8 +4726,8 @@ static int cgroup_attach_permissions(struct cgroup 
*src_cgrp,
return ret;
 }
 
-static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off)
+static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
+   bool threadgroup)
 {
struct cgroup *src_cgrp, *dst_cgrp;
struct task_struct *task;
@@ -4738,7 +4738,7 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file 
*of,
if (!dst_cgrp)
return -ENODEV;
 
-   task = cgroup_procs_write_start(buf, true, );
+   task = cgroup_procs_write_start(buf, threadgroup, );
ret = PTR_ERR_OR_ZERO(task);
if (ret)
goto out_unlock;
@@ -4748,19 +4748,26 @@ static ssize_t cgroup_procs_write(struct 
kernfs_open_file *of,
src_cgrp = task_cgroup_from_root(task, _dfl_root);
spin_unlock_irq(_set_lock);
 
+   /* process and thread migrations follow same delegation rule */
ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
-   of->file->f_path.dentry->d_sb, true);
+   of->file->f_path.dentry->d_sb, 
threadgroup);
if (ret)
goto out_finish;
 
-   ret = cgroup_attach_task(dst_cgrp, task, true);
+   ret = cgroup_attach_task(dst_cgrp, task, threadgroup);
 
 out_finish:
cgroup_procs_write_finish(task, locked);
 out_unlock:
cgroup_kn_unlock(of->kn);
 
-   return ret ?: nbytes;
+   return ret;
+}
+
+static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+   return __cgroup_procs_write(of, buf, true) ?: nbytes;
 }
 
 static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
@@ -4771,41 +4778,7 @@ static void *cgroup_threads_start(struct seq_file *s, 
loff_t *pos)
 static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
 {
-   struct cgroup *src_cgrp, *dst_cgrp;
-   struct task_struct *task;
-   ssize_t ret;
-   bool locked;
-
-   buf = strstrip(buf);
-
-   dst_cgrp = cgroup_kn_lock_live(of->kn, false);
-   if (!dst_cgrp)
-   return -ENODEV;
-
-   task = cgroup_procs_write_start(buf, false, );
-   ret = PTR_ERR_OR_ZERO(task);
-   if (ret)
-   goto out_unlock;
-
-   /* find the source cgroup */
-   spin_lock_irq(_set_lock);
-   src_cgrp = task_cgroup_from_root(task, _dfl_root);
-   spin_unlock_irq(_set_lock);
-
-   /* thread migrations follow the cgroup.procs delegation rule */
-   ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
-   of->file->f_path.dentry->d_sb, false);
-   if (ret)
-   goto out_finish;
-
-   ret = cgroup_attach_task(dst_cgrp, task, false);
-
-out_finish:
-   cgroup_procs_write_finish(task, locked);
-out_unlock:
-   cgroup_kn_unlock(of->kn);
-
-   return ret ?: nbytes;
+   return __cgroup_procs_write(of, buf, false) ?: nbytes;
 }
 
 /* cgroup core interface files for the default hierarchy */
-- 
2.29.2



Re: [PATCH] mm: memcontrol: prevent starvation when writing memory.high

2021-01-13 Thread Michal Koutný
On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner  
wrote:
> - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -  GFP_KERNEL, true);
> + try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +  GFP_KERNEL, true);
Although I was also initially confused by throwing 'reclaimed' info
away, the patch makes sense to me given the reasoning.

It is
Reviewed-by: Michal Koutný 

As for the discussed unsuccessful retries, I'd keep it a separate change.

Regards,
Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-20 Thread Michal Koutný
Hi.

On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt  
wrote:
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment.
The switch can happen only when singular (i.e. root-only) hierarchy
exists. (Or it could if rebind_subsystems() waited until all memcgs are
completely free'd.)

> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat.
So with the single memcg, it may be possible to reconstruct the
necessary counters in both directions using the statistics (or some
complementarity, without fine grained counters removal).

I didn't check all the charging/uncharging places, these are just my 2
cents to the issue.

Michal


signature.asc
Description: Digital signature


Re: [PATCH v2] mm: memcg: link page counters to root if use_hierarchy is false

2020-10-29 Thread Michal Koutný
Hi.

On Mon, Oct 26, 2020 at 04:13:26PM -0700, Roman Gushchin  wrote:
> Please note, that in the non-hierarchical mode all objcgs are always
> reparented to the root memory cgroup, even if the hierarchy has more
> than 1 level. This patch doesn't change it.
> 
> The patch also doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
I agree with the patch and you can add
Reviewed-by: Michal Koutný 

However, it effectively switches any users of root.use_hierarchy=0 (if there
are any, watching the counters of root memcg) into root.use_hierarchy=1.
So I'd show them the warning even with a single level of cgroups, i.e.
add this hunk

@@ -5356,12 +5356,11 @@
page_counter_init(>kmem, _mem_cgroup->kmem);
page_counter_init(>tcpmem, _mem_cgroup->tcpmem);
/*
-* Deeper hierachy with use_hierarchy == false doesn't make
+* Hierachy with use_hierarchy == false doesn't make
 * much sense so let cgroup subsystem know about this
 * unfortunate state in our controller.
 */
-   if (parent != root_mem_cgroup)
-   memory_cgrp_subsys.broken_hierarchy = true;
+   memory_cgrp_subsys.broken_hierarchy = true;
}
 
/* The following stuff does not apply to the root */

What do you think?

Michal


signature.asc
Description: Digital signature


Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Michal Koutný
Hi.

On Tue, Oct 20, 2020 at 06:52:08AM +0100, Richard Palethorpe 
 wrote:
> I don't think that is relevant as we get the memcg from objcg->memcg
> which is set during reparenting. I suppose however, we can determine if
> the objcg was reparented by inspecting memcg->objcg.
+1

> If we just check use_hierarchy then objects directly charged to the
> memcg where use_hierarchy=0 will not be uncharged. However, maybe it is
> better to check if it was reparented and if use_hierarchy=0.
I think (I had to make a table) the yielded condition would be:

if ((memcg->use_hierarchy && reparented) || (!mem_cgroup_is_root(memcg) && 
!reparented))
 __memcg_kmem_uncharge(memcg, nr_pages);

(I admit it's not very readable.)


Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Michal Koutný
Hi.

On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin  wrote:
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
s/ancestor/descendant/

> The root memory cgroup doesn't show the charge to a user, neither it
> does allow to set any limits/protections.
An appealing claim, I'd like this to be true...

> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
...and it almost is. But there are still exposed kmem and tcpmem counters.


> To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
If the assumptions behind this patch's idea were true, I think the
implementation would be simpler by merely (not)connecting the root
counters and keep the traversal as is.

> direct ascendants of the corresponding root memory cgroup's page
s/asc/desc/ ;-)

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Michal Koutný
On Fri, Oct 16, 2020 at 04:05:21PM +0100, Richard Palethorpe 
 wrote:
> I'm don't know if that could happen without reparenting. I suppose if
> use_hierarchy=1 then actually this patch will result in root being
> overcharged, so perhaps it should also check for use_hierarchy?
Right, you'd need to distinguish whether the uncharged objcg was
originally (not)charged in the root memcg or it was only reparented to
it. (I originally considered only the genuine root objcgs.)

In this light, homogenous charing to root memcg looks really simpler but
I wonder what other surprises it brings about.

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Michal Koutný
On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner  
wrote:
> The central try_charge() function charges recursively all the way up
> to and including the root.
Except for use_hiearchy=0 (which is the case here as Richard
wrote). The reparenting is hence somewhat incompatible with
new_parent.use_hiearchy=0 :-/

> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.
I agree this'd be good to unify. One upside of excluding root memcg from
charging is that users are spared from the charging overhead when memcg
tree is not created.  (Actually, I thought that was the reason for this
exception.)

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Michal Koutný
Hello.

On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe 
 wrote:
> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
Why do you think those are reparented objects? If those are originally
charged in a non-root cgroup, then the charge value should be propagated up the
hierarchy, including root memcg, so if they're later uncharged in root
after reparenting, it should still break even. (Or did I miss some stock
imbalance?)

(But the patch seems justifiable to me as objects (not)charged directly to
root memcg may be incorrectly uncharged.)

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [PATCH v4 4/4] mm: convert page kmemcg type to a page memcg flag

2020-10-02 Thread Michal Koutný
On Thu, Oct 01, 2020 at 01:27:13PM -0400, Johannes Weiner  
wrote:
> The activation code is the only path where page migration is not
> excluded. Because unlike with page state statistics, we don't really
> mind a race when counting an activation event.
Thanks for the explanation. I see why the accessor trio is justified.

> I do think there is a bug, though: mem_cgroup_move_account() should
> use WRITE_ONCE() on page->mem_cgroup.
If this were a bug, wouldn't be the proper approach
rcu_assign_pointer()/rcu_dereference() pair?

Michal


signature.asc
Description: Digital signature


Re: [PATCH v4 4/4] mm: convert page kmemcg type to a page memcg flag

2020-10-01 Thread Michal Koutný
Hi.

On Wed, Sep 30, 2020 at 05:27:10PM -0700, Roman Gushchin  wrote:
> @@ -369,8 +371,12 @@ enum page_memcg_data_flags {
>   */
>  static inline struct mem_cgroup *page_memcg(struct page *page)
>  {
> + unsigned long memcg_data = page->memcg_data;
> +
>   VM_BUG_ON_PAGE(PageSlab(page), page);
> - return (struct mem_cgroup *)page->memcg_data;
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +
> + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
Shouldn't this change go also into page_memcg_rcu()? (I don't think the
current single user (workingset_activation() would pass a non-slab
kernel page but for consistency sake.)

Alternatively, I'm thinking why (in its single use) is there
page_memcg_rcu() a separate function to page_memcg() (cross memcg page
migration?).

Regards,
Michal


signature.asc
Description: Digital signature


Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup

2020-08-12 Thread Michal Koutný
On Tue, Aug 11, 2020 at 12:32:28PM -0700, Roman Gushchin  wrote:
> If we'll limit init.slice (where systemd seems to reside), as you suggest,
> we'll eventually create trashing in init.slice, followed by OOM.
> I struggle to see how it makes the life of a user better?
> [...]
> The problem is that OOM-killing the accompanying manager won't release
> resources and help to get rid of accumulated cgroups.
I see your point now. I focused on the creator because of the live
memcgs.

When the issue are the dying memcgs (c), they were effectively released
by their creator but are pinned by whatever remained after their life
(LRU pages, slab->obj_cgroups). Since these pins were created _from
within_ such a child (c), they're most readily removable by reclaiming
(hierarchically) close to c. (It'd be achievable by limiting the lowest
common ancestor of manager and its product (typically root) but that is
more clumsy and less effective.)

This is the reasoning that justifies the remote charge.

Thanks!
Michal


signature.asc
Description: Digital signature


Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup

2020-08-11 Thread Michal Koutný
On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin  wrote:
> As I said, there are 2 problems with charging systemd (or a similar daemon):
> 1) It often belongs to the root cgroup.
This doesn't hold for systemd (if we agree that systemd is the most
common case).

> 2) OOMing or failing some random memory allocations is a bad way
>to "communicate" a memory shortage to the daemon.
>What we really want is to prevent creating a huge number of cgroups
There's cgroup.max.descendants for that...

>(including dying cgroups) in some specific sub-tree(s).
...oh, so is this limiting the number of cgroups or limiting resources
used?

>OOMing the daemon or returning -ENOMEM to some random syscalls
>will not help us to reach the goal and likely will bring a bad
>experience to a user.
If we reach the situation when memory for cgroup operations is tight,
it'll disappoint the user either way.
My premise is that a running workload is more valuable than the
accompanying manager.

> In a generic case I don't see how we can charge the cgroup which
> creates cgroups without solving these problems first.
In my understanding, "onbehalveness" is a concept useful for various
kernel threads doing deferred work. Here it's promoted to user processes
managing cgroups.

> And if there is a very special case where we have to limit it,
> we can just add an additional layer:
> 
> ` root or delegated root
>` manager-parent-cgroup-with-a-limit
>  ` manager-cgroup (systemd, docker, ...)
>` [aggregation group(s)]
>  ` job-group-1
>  ` ...
>  ` job-group-n
If the charge goes to the parent of created cgroup (job-cgroup-i here),
then the layer adds nothing. Am I missing something?

> I'd definitely charge the parent cgroup in all similar cases.
(This would mandate the controllers on the unified hierarchy, which is
fine IMO.) Then the order of enabling controllers on a subtree (e.g.
cpu,memory vs memory,cpu) by the manager would yield different charging.
This seems wrong^W confusing to me. 


Thanks,
Michal


signature.asc
Description: Digital signature


Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-08-11 Thread Michal Koutný
Hi Shakeel.

On Tue, Jul 07, 2020 at 10:02:50AM -0700, Shakeel Butt  
wrote:
> > Well, I was talkingg about memory.low. It is not meant only to protect
> > from the global reclaim. It can be used for balancing memory reclaim
> > from _any_ external memory pressure source. So it is somehow related to
> > the usecase you have mentioned.
> >
> 
> For the uswapd use-case, I am not concerned about the external memory
> pressure source but the application hitting its own memory.high limit
> and getting throttled.
FTR, you can transform own memory.high into "external" pressure with a
hierarchy such as

  limit-group   memory.high=N+margin memory.low=0
  `- latency-sensitive-groupmemory.low=N
  `- regular-group  memory.low=0

Would that ensure the latency targets?

Michal


signature.asc
Description: Digital signature


Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup

2020-08-11 Thread Michal Koutný
On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin  wrote:
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
The generic scheme would be (assuming the no internal process
constraint, in the root too)

` root or delegated root
  ` manager-cgroup (systemd, docker, ...)
  ` [aggregation group(s)]
` job-group-1
` ...
` job-group-n

> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
No dispute about this in either approaches.

> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
What I mean is that systemd should be charged for the cgroup creation.
Or more generally, any container/cgroup manager should be charged.
Consider a leak when it wouldn't remove spent cgroups, IMO the effect
(throttling, reclaim) should be exercised on such a culprit.

I don't think the existing workload (job-group-i above) should
unnecessarily suffer when only manager is acting up. Is that different
from your idea?

> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of 
> other
> ways how a task from C00 can hurt C1.
I agree with that.


If I haven't overlooked anything, this should be first case when
cgroup-related structures are accounted (please correct me).
So this is setting a precendent, if others show useful to be accounted
in the future too. I'm thinking about cpu_cgroup_css_alloc() that can
also allocate a lot (with big CPU count). The current approach would lead
situations where matching cpu and memory csses needn't to exist and that
would need special handling.


> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
Despite my concerns, I don't think this is fundamental and can't be
changed later so it doesn't prevent the inclusion in 5.9 RC1.

Regards,
Michal


signature.asc
Description: Digital signature


Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup

2020-07-29 Thread Michal Koutný
Hello.

On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin  wrote:
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
No doubt about accounting the memory if it's significant amount.

> Let's account the consumed percpu memory to the parent cgroup.
Why did you choose charging to the parent of the created cgroup?

Should the charge go the cgroup _that is creating_ the new memcg?

One reason is that there are the throttling mechanisms for memory limits
and those are better exercised when the actor and its memory artefact
are the same cgroup, aren't they?

The second reason is based on the example Dlegation Containment
(Documentation/admin-guide/cgroup-v2.rst)

> For an example, let's assume cgroups C0 and C1 have been delegated to
> user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> all processes under C0 and C1 belong to U0::
> 
>   ~ - C0 - C00
>   ~ cgroup~  \ C01
>   ~ hierarchy ~
>   ~ - C1 - C10

Thanks to permissions a task running in C0 creating a cgroup in C1 would
deplete C1's supply victimizing tasks inside C1.

Thanks,
Michal


signature.asc
Description: Digital signature


[PATCH 3/3] docs: cgroup: No special handling of unpopulated memcgs

2020-07-29 Thread Michal Koutný
The current kernel doesn't handle unpopulated cgroups any special
regarding reclaim protection. Furthermore, this wasn't a case even when
this was introduced in
bf8d5d52ffe89 ("memcg: introduce memory.min")
Drop the incorrect documentation. (Implementation taking into account
the inner-node constraint may be added later.)

Signed-off-by: Michal Koutný 
---
 Documentation/admin-guide/cgroup-v2.rst | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 47f9f056e66f..3d62922c4499 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1123,9 +1123,6 @@ PAGE_SIZE multiple when read back.
Putting more memory than generally available under this
protection is discouraged and may lead to constant OOMs.
 
-   If a memory cgroup is not populated with processes,
-   its memory.min is ignored.
-
   memory.low
A read-write single value file which exists on non-root
cgroups.  The default is "0".
-- 
2.27.0



[PATCH 0/3] Memory reclaim documentation fixes

2020-07-29 Thread Michal Koutný
Hi,
I think the reclaim target is a concept that is not just an
implementation detail and hence it should be documented how it applies
to protection configuration (the first patch). Second patch is a "best
practice" bit of information, it may be squashed with the first one. The
last patch just makes docs indefinite until the idea is implemented.


Michal Koutný (3):
  docs: cgroup: Explain reclaim protection target
  docs: cgroup: Note about sibling relative reclaim protection
  docs: cgroup: No special handling of unpopulated memcgs

 Documentation/admin-guide/cgroup-v2.rst | 31 -
 1 file changed, 25 insertions(+), 6 deletions(-)

-- 
2.27.0



[PATCH 1/3] docs: cgroup: Explain reclaim protection target

2020-07-29 Thread Michal Koutný
Signed-off-by: Michal Koutný 
---
 Documentation/admin-guide/cgroup-v2.rst | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index d09471aa7443..94bdff4f9e09 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -47,7 +47,8 @@ v1 is available under 
:ref:`Documentation/admin-guide/cgroup-v1/index.rst 

[PATCH 2/3] docs: cgroup: Note about sibling relative reclaim protection

2020-07-29 Thread Michal Koutný
Signed-off-by: Michal Koutný 
---
 Documentation/admin-guide/cgroup-v2.rst | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 94bdff4f9e09..47f9f056e66f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1467,6 +1467,10 @@ targets ancestors of A, the effective protection of B is 
capped by the
 protection value configured for A (and any other intermediate ancestors between
 A and the target).
 
+To express indifference about relative sibling protection, it is suggested to
+use memory_recursiveprot. Configuring all descendants of a parent with finite
+protection to "max" works but it may unnecessarily skew memory.events:low
+field.
 
 Memory Ownership
 
-- 
2.27.0



[PATCH] /proc/PID/smaps: Consistent whitespace output format

2020-07-28 Thread Michal Koutný
The keys in smaps output are padded to fixed width with spaces.
All except for THPeligible that uses tabs (only since
commit c06306696f83 ("mm: thp: fix false negative of shmem vma's THP
eligibility")).
Unify the output formatting to save time debugging some naïve parsers.
(Part of the unification is also aligning FilePmdMapped with others.)

Signed-off-by: Michal Koutný 
---
 fs/proc/task_mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dbda4499a859..5066b0251ed8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -786,7 +786,7 @@ static void __show_smap(struct seq_file *m, const struct 
mem_size_stats *mss,
SEQ_PUT_DEC(" kB\nLazyFree:   ", mss->lazyfree);
SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
-   SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp);
+   SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
  mss->private_hugetlb >> 10, 7);
@@ -816,7 +816,7 @@ static int show_smap(struct seq_file *m, void *v)
 
__show_smap(m, , false);
 
-   seq_printf(m, "THPeligible: %d\n",
+   seq_printf(m, "THPeligible:%d\n",
   transparent_hugepage_enabled(vma));
 
if (arch_pkeys_enabled())
-- 
2.27.0



Re: [PATCH v4 2/3] nsproxy: attach to namespaces via pidfds

2020-06-24 Thread Michal Koutný
On Wed, Jun 24, 2020 at 01:54:56PM +0200, Christian Brauner 
 wrote:
> Yep, I already have a fix for this in my tree based on a previous
> report from LTP.
Perfect. (Sorry for the noise then.)

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [PATCH v4 2/3] nsproxy: attach to namespaces via pidfds

2020-06-24 Thread Michal Koutný
Hi.

On Tue, May 05, 2020 at 04:04:31PM +0200, Christian Brauner 
 wrote:
> -SYSCALL_DEFINE2(setns, int, fd, int, nstype)
> +SYSCALL_DEFINE2(setns, int, fd, int, flags)
> [...]
> - file = proc_ns_fget(fd);
> - if (IS_ERR(file))
> - return PTR_ERR(file);
> + int err = 0;
>  
> - err = -EINVAL;
> - ns = get_proc_ns(file_inode(file));
> - if (nstype && (ns->ops->type != nstype))
> + file = fget(fd);
> + if (!file)
> + return -EBADF;
> +
> + if (proc_ns_file(file)) {
> + ns = get_proc_ns(file_inode(file));
> + if (flags && (ns->ops->type != flags))
> + err = -EINVAL;
> + flags = ns->ops->type;
> + } else if (pidfd_pid(file)) {
> + err = check_setns_flags(flags);
> + } else {
> + err = -EBADF;
> + }
> + if (err)
>   goto out;
>  
> - err = prepare_nsset(ns->ops->type, );
> + err = prepare_nsset(flags, );
>   if (err)
>   goto out;
This modification changed the returned error when a valid file
descriptor is passed but it doesn't represent a namespace (nor pidfd).
The error is now EBADF although originally and per man page it
was/should be EINVAL.

A change like below would restore it, however, I see it may be less
consistent with other pidfd calls(?), then I'd suggest updating the
manpage to capture this.

--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -531,7 +531,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
} else if (!IS_ERR(pidfd_pid(file))) {
err = check_setns_flags(flags);
} else {
-   err = -EBADF;
+   err = -EINVAL;
}
if (err)
goto out;

I noticed this breaks systemd self tests [1].

Regards,
Michal


[1] 
https://github.com/systemd/systemd/blob/a1ba8c5b71164665ccb53c9cec384e5eef7d3689/src/test/test-seccomp.c#L246


signature.asc
Description: Digital signature


Re: [PATCH 3/6] mm, memcg: Prevent memory.low load/store tearing

2020-06-12 Thread Michal Koutný
Hello.

I see suspicious asymmetry, in the current mainline:
>   WRITE_ONCE(memcg->memory.emin, effective_protection(usage, parent_usage,
>   READ_ONCE(memcg->memory.min),
>   READ_ONCE(parent->memory.emin),
>   atomic_long_read(>memory.children_min_usage)));
>
>   WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage,
>   memcg->memory.low, READ_ONCE(parent->memory.elow),
>   atomic_long_read(>memory.children_low_usage)));

On Thu, Mar 12, 2020 at 05:33:01PM +, Chris Down  
wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aca2964ea494..c85a304fa4a1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6262,7 +6262,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
> mem_cgroup *root,
>   return MEMCG_PROT_NONE;
>  
>   emin = memcg->memory.min;
> - elow = memcg->memory.low;
> + elow = READ_ONCE(memcg->memory.low);
>  
>   parent = parent_mem_cgroup(memcg);
>   /* No parent means a non-hierarchical mode on v1 memcg */
> @@ -6291,7 +6291,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
> mem_cgroup *root,
>   if (elow && parent_elow) {
>   unsigned long low_usage, siblings_low_usage;
>  
> - low_usage = min(usage, memcg->memory.low);
> + low_usage = min(usage, READ_ONCE(memcg->memory.low));
>   siblings_low_usage = atomic_long_read(
>   >memory.children_low_usage);
Is it possible that these hunks were lost during rebase/merge?

IMHO it should apply as:

-- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6428,7 +6428,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
atomic_long_read(>memory.children_min_usage)));

WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage,
-   memcg->memory.low, READ_ONCE(parent->memory.elow),
+   READ_ONCE(memcg->memory.low),
+   READ_ONCE(parent->memory.elow),
atomic_long_read(>memory.children_low_usage)));

 out:


Michal


signature.asc
Description: Digital signature


Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

2019-10-07 Thread Michal Koutný
On Fri, Oct 04, 2019 at 03:11:04PM -0700, Roman Gushchin  wrote:
> An inode which is getting dirty for the first time is associated
> with the wb structure (look at __inode_attach_wb()). It can later
> be switched to another wb under some conditions (e.g. some other
> cgroup is writing a lot of data to the same inode), but generally
> stays associated up to the end of life of the inode structure.
What about dissociating the wb structure from the charged cgroup after
the particular writeback finished? (I understand from your description
that wb structure outlives the dirtier and is kept due to other inode
(read) users, not sure if that's correct assumption.)

Michal


signature.asc
Description: Digital signature


[PATCH 3/5] selftests: cgroup: Simplify task self migration

2019-10-04 Thread Michal Koutný
Simplify task migration by being oblivious about its PID during
migration. This allows to easily migrate individual threads as well.
This change brings no functional change and prepares grounds for thread
granularity migrating tests.

Signed-off-by: Michal Koutný 
---
 tools/testing/selftests/cgroup/cgroup_util.c  | 16 +++-
 tools/testing/selftests/cgroup/cgroup_util.h  |  4 +++-
 tools/testing/selftests/cgroup/test_freezer.c |  2 +-
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
b/tools/testing/selftests/cgroup/cgroup_util.c
index bdb69599c4bd..f6573eac1365 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -282,10 +282,12 @@ int cg_enter(const char *cgroup, int pid)
 
 int cg_enter_current(const char *cgroup)
 {
-   char pidbuf[64];
+   return cg_write(cgroup, "cgroup.procs", "0");
+}
 
-   snprintf(pidbuf, sizeof(pidbuf), "%d", getpid());
-   return cg_write(cgroup, "cgroup.procs", pidbuf);
+int cg_enter_current_thread(const char *cgroup)
+{
+   return cg_write(cgroup, "cgroup.threads", "0");
 }
 
 int cg_run(const char *cgroup,
@@ -410,11 +412,15 @@ int set_oom_adj_score(int pid, int score)
return 0;
 }
 
-char proc_read_text(int pid, const char *item, char *buf, size_t size)
+ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, 
size_t size)
 {
char path[PATH_MAX];
 
-   snprintf(path, sizeof(path), "/proc/%d/%s", pid, item);
+   if (!pid)
+   snprintf(path, sizeof(path), "/proc/%s/%s",
+thread ? "thread-self" : "self", item);
+   else
+   snprintf(path, sizeof(path), "/proc/%d/%s", pid, item);
 
return read_text(path, buf, size);
 }
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h 
b/tools/testing/selftests/cgroup/cgroup_util.h
index c72f28046bfa..27ff21d82af1 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -1,4 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#include 
 #include 
 
 #define PAGE_SIZE 4096
@@ -35,6 +36,7 @@ extern int cg_run(const char *cgroup,
  void *arg);
 extern int cg_enter(const char *cgroup, int pid);
 extern int cg_enter_current(const char *cgroup);
+extern int cg_enter_current_thread(const char *cgroup);
 extern int cg_run_nowait(const char *cgroup,
 int (*fn)(const char *cgroup, void *arg),
 void *arg);
@@ -45,4 +47,4 @@ extern int is_swap_enabled(void);
 extern int set_oom_adj_score(int pid, int score);
 extern int cg_wait_for_proc_count(const char *cgroup, int count);
 extern int cg_killall(const char *cgroup);
-extern char proc_read_text(int pid, const char *item, char *buf, size_t size);
+extern ssize_t proc_read_text(int pid, bool thread, const char *item, char 
*buf, size_t size);
diff --git a/tools/testing/selftests/cgroup/test_freezer.c 
b/tools/testing/selftests/cgroup/test_freezer.c
index 8219a30853d2..5896e35168cd 100644
--- a/tools/testing/selftests/cgroup/test_freezer.c
+++ b/tools/testing/selftests/cgroup/test_freezer.c
@@ -648,7 +648,7 @@ static int proc_check_stopped(int pid)
char buf[PAGE_SIZE];
int len;
 
-   len = proc_read_text(pid, "stat", buf, sizeof(buf));
+   len = proc_read_text(pid, 0, "stat", buf, sizeof(buf));
if (len == -1) {
debug("Can't get %d stat\n", pid);
return -1;
-- 
2.21.0



[PATCH 4/5] selftests: cgroup: Add task migration tests

2019-10-04 Thread Michal Koutný
Add two new tests that verify that thread and threadgroup migrations
work as expected.

Signed-off-by: Michal Koutný 
---
 tools/testing/selftests/cgroup/Makefile  |   2 +-
 tools/testing/selftests/cgroup/cgroup_util.c |  26 
 tools/testing/selftests/cgroup/cgroup_util.h |   2 +
 tools/testing/selftests/cgroup/test_core.c   | 146 +++
 4 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/Makefile 
b/tools/testing/selftests/cgroup/Makefile
index 8d369b6a2069..1c9179400be0 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-CFLAGS += -Wall
+CFLAGS += -Wall -pthread
 
 all:
 
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
b/tools/testing/selftests/cgroup/cgroup_util.c
index f6573eac1365..8f7131dcf1ff 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -158,6 +158,22 @@ long cg_read_key_long(const char *cgroup, const char 
*control, const char *key)
return atol(ptr + strlen(key));
 }
 
+long cg_read_lc(const char *cgroup, const char *control)
+{
+   char buf[PAGE_SIZE];
+   const char delim[] = "\n";
+   char *line;
+   long cnt = 0;
+
+   if (cg_read(cgroup, control, buf, sizeof(buf)))
+   return -1;
+
+   for (line = strtok(buf, delim); line; line = strtok(NULL, delim))
+   cnt++;
+
+   return cnt;
+}
+
 int cg_write(const char *cgroup, const char *control, char *buf)
 {
char path[PATH_MAX];
@@ -424,3 +440,13 @@ ssize_t proc_read_text(int pid, bool thread, const char 
*item, char *buf, size_t
 
return read_text(path, buf, size);
 }
+
+int proc_read_strstr(int pid, bool thread, const char *item, const char 
*needle)
+{
+   char buf[PAGE_SIZE];
+
+   if (proc_read_text(pid, thread, item, buf, sizeof(buf)) < 0)
+   return -1;
+
+   return strstr(buf, needle) ? 0 : -1;
+}
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h 
b/tools/testing/selftests/cgroup/cgroup_util.h
index 27ff21d82af1..49c54fbdb229 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -30,6 +30,7 @@ extern int cg_read_strstr(const char *cgroup, const char 
*control,
  const char *needle);
 extern long cg_read_long(const char *cgroup, const char *control);
 long cg_read_key_long(const char *cgroup, const char *control, const char 
*key);
+extern long cg_read_lc(const char *cgroup, const char *control);
 extern int cg_write(const char *cgroup, const char *control, char *buf);
 extern int cg_run(const char *cgroup,
  int (*fn)(const char *cgroup, void *arg),
@@ -48,3 +49,4 @@ extern int set_oom_adj_score(int pid, int score);
 extern int cg_wait_for_proc_count(const char *cgroup, int count);
 extern int cg_killall(const char *cgroup);
 extern ssize_t proc_read_text(int pid, bool thread, const char *item, char 
*buf, size_t size);
+extern int proc_read_strstr(int pid, bool thread, const char *item, const char 
*needle);
diff --git a/tools/testing/selftests/cgroup/test_core.c 
b/tools/testing/selftests/cgroup/test_core.c
index 79053a4f4783..c5ca669feb2b 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -5,6 +5,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "../kselftest.h"
 #include "cgroup_util.h"
@@ -354,6 +357,147 @@ static int test_cgcore_internal_process_constraint(const 
char *root)
return ret;
 }
 
+static void *dummy_thread_fn(void *arg)
+{
+   return (void *)(size_t)pause();
+}
+
+/*
+ * Test threadgroup migration.
+ * All threads of a process are migrated together.
+ */
+static int test_cgcore_proc_migration(const char *root)
+{
+   int ret = KSFT_FAIL;
+   int t, c_threads, n_threads = 13;
+   char *src = NULL, *dst = NULL;
+   pthread_t threads[n_threads];
+
+   src = cg_name(root, "cg_src");
+   dst = cg_name(root, "cg_dst");
+   if (!src || !dst)
+   goto cleanup;
+
+   if (cg_create(src))
+   goto cleanup;
+   if (cg_create(dst))
+   goto cleanup;
+
+   if (cg_enter_current(src))
+   goto cleanup;
+
+   for (c_threads = 0; c_threads < n_threads; ++c_threads) {
+   if (pthread_create([c_threads], NULL, dummy_thread_fn, 
NULL))
+   goto cleanup;
+   }
+
+   cg_enter_current(dst);
+   if (cg_read_lc(dst, "cgroup.threads") != n_threads + 1)
+   goto cleanup;
+
+   ret = KSFT_PASS;
+
+cleanup:
+   for (t = 0; t < c_threads; ++t) {
+   pthread_cancel(threads[t]);
+   }
+
+   for (t = 0; t < c_threads; ++t) {
+   pthread_join(threads[

[PATCH 5/5] selftests: cgroup: Run test_core under interfering stress

2019-10-04 Thread Michal Koutný
test_core tests various cgroup creation/removal and task migration
paths. Run the tests repeatedly with interfering noise (for lockdep
checks). Currently, forking noise and subsystem enabled/disabled
switching are the implemented noises.

Signed-off-by: Michal Koutný 
---
 tools/testing/selftests/cgroup/Makefile   |   2 +
 tools/testing/selftests/cgroup/test_stress.sh |   4 +
 tools/testing/selftests/cgroup/with_stress.sh | 101 ++
 3 files changed, 107 insertions(+)
 create mode 100755 tools/testing/selftests/cgroup/test_stress.sh
 create mode 100755 tools/testing/selftests/cgroup/with_stress.sh

diff --git a/tools/testing/selftests/cgroup/Makefile 
b/tools/testing/selftests/cgroup/Makefile
index 1c9179400be0..66aafe1f5746 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -3,6 +3,8 @@ CFLAGS += -Wall -pthread
 
 all:
 
+TEST_FILES := with_stress.sh
+TEST_PROGS := test_stress.sh
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_core
 TEST_GEN_PROGS += test_freezer
diff --git a/tools/testing/selftests/cgroup/test_stress.sh 
b/tools/testing/selftests/cgroup/test_stress.sh
new file mode 100755
index ..15d9d5896394
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_stress.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+./with_stress.sh -s subsys -s fork ./test_core
diff --git a/tools/testing/selftests/cgroup/with_stress.sh 
b/tools/testing/selftests/cgroup/with_stress.sh
new file mode 100755
index ..e28c35008f5b
--- /dev/null
+++ b/tools/testing/selftests/cgroup/with_stress.sh
@@ -0,0 +1,101 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+stress_fork()
+{
+   while true ; do
+   /usr/bin/true
+   sleep 0.01
+   done
+}
+
+stress_subsys()
+{
+   local verb=+
+   while true ; do
+   echo $verb$subsys_ctrl >$sysfs/cgroup.subtree_control
+   [ $verb = "+" ] && verb=- || verb=+
+   # incommensurable period with other stresses
+   sleep 0.011
+   done
+}
+
+init_and_check()
+{
+   sysfs=`mount -t cgroup2 | head -1 | awk '{ print $3 }'`
+   if [ ! -d "$sysfs" ]; then
+   echo "Skipping: cgroup2 is not mounted" >&2
+   exit $ksft_skip
+   fi
+
+   if ! echo +$subsys_ctrl >$sysfs/cgroup.subtree_control ; then
+   echo "Skipping: cannot enable $subsys_ctrl in $sysfs" >&2
+   exit $ksft_skip
+   fi
+
+   if ! echo -$subsys_ctrl >$sysfs/cgroup.subtree_control ; then
+   echo "Skipping: cannot disable $subsys_ctrl in $sysfs" >&2
+   exit $ksft_skip
+   fi
+}
+
+declare -a stresses
+declare -a stress_pids
+duration=5
+rc=0
+subsys_ctrl=cpuset
+sysfs=
+
+while getopts c:d:hs: opt; do
+   case $opt in
+   c)
+   subsys_ctrl=$OPTARG
+   ;;
+   d)
+   duration=$OPTARG
+   ;;
+   h)
+   echo "Usage $0 [ -s stress ] ... [ -d duration ] [-c 
controller] cmd args .."
+   echo -e "\t default duration $duration seconds"
+   echo -e "\t default controller $subsys_ctrl"
+   exit
+   ;;
+   s)
+   func=stress_$OPTARG
+   if [ "x$(type -t $func)" != "xfunction" ] ; then
+   echo "Unknown stress $OPTARG"
+   exit 1
+   fi
+   stresses+=($func)
+   ;;
+   esac
+done
+shift $((OPTIND - 1))
+
+init_and_check
+
+for s in ${stresses[*]} ; do
+   $s &
+   stress_pids+=($!)
+done
+
+
+time=0
+start=$(date +%s)
+
+while [ $time -lt $duration ] ; do
+   $*
+   rc=$?
+   [ $rc -eq 0 ] || break
+   time=$(($(date +%s) - $start))
+done
+
+for pid in ${stress_pids[*]} ; do
+   kill -SIGTERM $pid
+   wait $pid
+done
+
+exit $rc
-- 
2.21.0



[PATCH 1/5] cgroup: Update comments about task exit path

2019-10-04 Thread Michal Koutný
We no longer take cgroup_mutex in cgroup_exit and the exiting tasks are
not moved to init_css_set, reflect that in several comments to prevent
confusion.

Signed-off-by: Michal Koutný 
---
 kernel/cgroup/cgroup.c | 29 +
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 753afbca549f..1488bb732902 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -899,8 +899,7 @@ static void css_set_move_task(struct task_struct *task,
/*
 * We are synchronized through cgroup_threadgroup_rwsem
 * against PF_EXITING setting such that we can't race
-* against cgroup_exit() changing the css_set to
-* init_css_set and dropping the old one.
+* against cgroup_exit()/cgroup_free() dropping the css_set.
 */
WARN_ON_ONCE(task->flags & PF_EXITING);
 
@@ -1430,9 +1429,8 @@ struct cgroup *task_cgroup_from_root(struct task_struct 
*task,
 struct cgroup_root *root)
 {
/*
-* No need to lock the task - since we hold cgroup_mutex the
-* task can't change groups, so the only thing that can happen
-* is that it exits and its css is set back to init_css_set.
+* No need to lock the task - since we hold css_set_lock the
+* task can't change groups.
 */
return cset_cgroup_from_root(task_css_set(task), root);
 }
@@ -6020,7 +6018,7 @@ void cgroup_post_fork(struct task_struct *child)
struct css_set *cset;
 
spin_lock_irq(_set_lock);
-   cset = task_css_set(current);
+   cset = task_css_set(current); /* current is @child's parent */
if (list_empty(>cg_list)) {
get_css_set(cset);
cset->nr_tasks++;
@@ -6063,20 +6061,8 @@ void cgroup_post_fork(struct task_struct *child)
  * cgroup_exit - detach cgroup from exiting task
  * @tsk: pointer to task_struct of exiting process
  *
- * Description: Detach cgroup from @tsk and release it.
- *
- * Note that cgroups marked notify_on_release force every task in
- * them to take the global cgroup_mutex mutex when exiting.
- * This could impact scaling on very large systems.  Be reluctant to
- * use notify_on_release cgroups where very high task exit scaling
- * is required on large systems.
+ * Description: Detach cgroup from @tsk.
  *
- * We set the exiting tasks cgroup to the root cgroup (top_cgroup).  We
- * call cgroup_exit() while the task is still competent to handle
- * notify_on_release(), then leave the task attached to the root cgroup in
- * each hierarchy for the remainder of its exit.  No need to bother with
- * init_css_set refcnting.  init_css_set never goes away and we can't race
- * with migration path - PF_EXITING is visible to migration path.
  */
 void cgroup_exit(struct task_struct *tsk)
 {
@@ -6086,7 +6072,8 @@ void cgroup_exit(struct task_struct *tsk)
 
/*
 * Unlink from @tsk from its css_set.  As migration path can't race
-* with us, we can check css_set and cg_list without synchronization.
+* with us (thanks to cgroup_threadgroup_rwsem), we can check css_set
+* and cg_list without synchronization.
 */
cset = task_css_set(tsk);
 
@@ -6102,6 +6089,8 @@ void cgroup_exit(struct task_struct *tsk)
 
spin_unlock_irq(_set_lock);
} else {
+   /* Take reference to avoid freeing init_css_set in cgroup_free,
+* see cgroup_fork(). */
get_css_set(cset);
}
 
-- 
2.21.0



[PATCH 0/5] Optimize single thread migration

2019-10-04 Thread Michal Koutný
Hello.

The important part is the patch 02 where the reasoning is.

The rest is mostly auxiliar and split out into separate commits for
better readability.

The patches are based on v5.3. 

Michal


Michal Koutný (5):
  cgroup: Update comments about task exit path
  cgroup: Optimize single thread migration
  selftests: cgroup: Simplify task self migration
  selftests: cgroup: Add task migration tests
  selftests: cgroup: Run test_core under interfering stress

 kernel/cgroup/cgroup-internal.h   |   5 +-
 kernel/cgroup/cgroup-v1.c |   5 +-
 kernel/cgroup/cgroup.c|  68 
 tools/testing/selftests/cgroup/Makefile   |   4 +-
 tools/testing/selftests/cgroup/cgroup_util.c  |  42 -
 tools/testing/selftests/cgroup/cgroup_util.h  |   6 +-
 tools/testing/selftests/cgroup/test_core.c| 146 ++
 tools/testing/selftests/cgroup/test_freezer.c |   2 +-
 tools/testing/selftests/cgroup/test_stress.sh |   4 +
 tools/testing/selftests/cgroup/with_stress.sh | 101 
 10 files changed, 342 insertions(+), 41 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_stress.sh
 create mode 100755 tools/testing/selftests/cgroup/with_stress.sh

-- 
2.21.0



[PATCH 2/5] cgroup: Optimize single thread migration

2019-10-04 Thread Michal Koutný
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c32a ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.

The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.

There are several places that need to synchronize with migration:
a) do_exit,
b) de_thread,
c) copy_process,
d) cgroup_update_dfl_csses,
e) parallel migration (cgroup_{proc,thread}s_write).

In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.

This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.

This change improves migration dependent workload performance similar
to per-signal_struct state.

Signed-off-by: Michal Koutný 
---
 kernel/cgroup/cgroup-internal.h |  5 +++--
 kernel/cgroup/cgroup-v1.c   |  5 +++--
 kernel/cgroup/cgroup.c  | 39 +
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 809e34a3c017..90d1710fef6c 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -231,9 +231,10 @@ int cgroup_migrate(struct task_struct *leader, bool 
threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
   bool threadgroup);
-struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
+bool *locked)
__acquires(_threadgroup_rwsem);
-void cgroup_procs_write_finish(struct task_struct *task)
+void cgroup_procs_write_finish(struct task_struct *task, bool locked)
__releases(_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 88006be40ea3..dd0b263430d5 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -514,12 +514,13 @@ static ssize_t __cgroup1_procs_write(struct 
kernfs_open_file *of,
struct task_struct *task;
const struct cred *cred, *tcred;
ssize_t ret;
+   bool locked;
 
cgrp = cgroup_kn_lock_live(of->kn, false);
if (!cgrp)
return -ENODEV;
 
-   task = cgroup_procs_write_start(buf, threadgroup);
+   task = cgroup_procs_write_start(buf, threadgroup, );
ret = PTR_ERR_OR_ZERO(task);
if (ret)
goto out_unlock;
@@ -541,7 +542,7 @@ static ssize_t __cgroup1_procs_write(struct 
kernfs_open_file *of,
ret = cgroup_attach_task(cgrp, task, threadgroup);
 
 out_finish:
-   cgroup_procs_write_finish(task);
+   cgroup_procs_write_finish(task, locked);
 out_unlock:
cgroup_kn_unlock(of->kn);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1488bb732902..3f806dbe279b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2822,7 +2822,8 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
task_struct *leader,
return ret;
 }
 
-struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
+bool *locked)
__acquires(_threadgroup_rwsem)
 {
struct task_struct *tsk;
@@ -2831,7 +2832,21 @@ struct task_struct *cgroup_procs_write_start(char *buf, 
bool threadgroup)
if (kstrtoint(strstrip(buf), 0, ) || pid < 0)
return ERR_PTR(-EINVAL);
 
-   percpu_down_write(_threadgroup_rwsem);
+   /*
+* If we migrate a single thread, we don't care about threadgroup
+* stability. If the thread is `current`, it won't exit(2) under our
+* hands or change PID through exec(2). We exclude
+* cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+* callers by cgroup_mutex.
+* Therefore, we can skip the global lock.
+*/
+   lockdep_assert_held(_mutex);
+   if (pid || threadgroup) {
+   percpu_down_write(_threadgroup_rwsem);
+   *locked = true;
+   } else {
+   *locked = false;
+

Re: [PATCH 08/10] blkcg: implement blk-iocost

2019-10-03 Thread Michal Koutný
Hi (and apology for relatively late reply).

On Tue, Sep 10, 2019 at 09:08:55AM -0700, Tejun Heo  wrote:
> I can implement the switching if so.
I see the "conflict" is solved by the switching.

> Initially, I put them under block device sysfs but it was too clumsy
> with different config file formats and all.
Do you have any more details on that? In the end, it all boils down to a
daemon/setup utility writing into the control files and it can use
whatever config files it decides, can't it?

> I think it's better to have global controller configs at the root
> cgroup.
I agree with the "global controller" configs, however, does it also hold
for "global controller per-device" configs? They seem closer to the
device than the controller. Potentially, the parameters could be used by
some other consumers in the future. (I'm not opposing the current form,
I just want to explore the space before an API is fixed.)


> Not at all.  These are system-wide configs.  cgroup namespaces
> shouldn't have anything which aren't in non-root cgroups.
Thanks, I understand the cgroup namespaces are not meant to be
transparent to their members.

Michal


signature.asc
Description: Digital signature


Re: [PATCH RFC 00/14] The new slab memory controller

2019-10-03 Thread Michal Koutný
On Wed, Oct 02, 2019 at 10:00:07PM +0900, Suleiman Souhlal 
 wrote:
> kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
> I am however not aware of any automation based on it.
My experience is the same. However, the point is that this has been
exposed since ages, so the safe assumption is that there may be users.

> Maybe it might be worth adding it to cgroup v2 and have a CONFIG
> option to enable it?
I don't think v2 file is necessary given the cost of obtaining the
information. But I concur the idea of making the per-object tracking
switchable at boot time or at least CONFIGurable.

Michal


signature.asc
Description: Digital signature


Re: [PATCH RFC 00/14] The new slab memory controller

2019-10-01 Thread Michal Koutný
On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin  wrote:
> Roman Gushchin (14):
> [...]
>   mm: memcg/slab: use one set of kmem_caches for all memory cgroups
From that commit's message:

> 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
>   no per-memcg kmem_caches anymore (empty output is printed)

The empty file means no allocations took place in the particular cgroup.
I find this quite a surprising change for consumers of these stats.

I understand obtaining the same data efficiently from the proposed
structures is difficult, however, such a change should be avoided. (In
my understanding, obsoleted file ~ not available in v2, however, it
should not disappear from v1.)

Michal


signature.asc
Description: Digital signature


Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-30 Thread Michal Koutný
Hi.

On Thu, Sep 26, 2019 at 05:55:29PM -0700, Mina Almasry  
wrote:
> My guess is that a new controller needs to support cgroups-v2, which
> is fine. But can a new controller also support v1? Or is there a
> requirement that new controllers support *only* v2? I need whatever
> solution here to work on v1.
Here is my view of important criteria:

1) stability, v1 APIs and semantics should not be changed,
2) futureproofness, v1 uses should be convertible to v2 uses,
3) maintainability, the less (similar) code the better.

And here is my braindump of some approaches:

A) new controller, v2 only
- 1) ok
- 2) may be ok
- 3) separate v1 and v2 implementations
- exclusion must be ensured on hybrid hierarchies

B) new controller, version oblivious (see e.g. pid)
- 1) sort of ok
- 2) partially ok
- 3) two v1 implementations
- exclusion must be ensured even on pure v1 hierarchies

C) extending the existing controller, w/out v2 counterpart
- 1) ok with workarounds (new option switching behavior)
- 2) not ok
- 3) likely OK

D) extending the existing controller, with v2 counterpart
- 1) ok with workarounds (new option switching behavior, see cpuset)
- 2) may be ok
- 3) likely OK

AFAIU, the current patchset is variation of C). Personally, I think
something like D) could work, I'm not convinced about A) and B) based on
the breakdown above. But it may induce some other ideas.


My two cents,
Michal


signature.asc
Description: Digital signature


Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-09-03 Thread Michal Koutný
On Mon, Sep 02, 2019 at 04:02:57PM -0700, Suren Baghdasaryan 
 wrote:
> > +static inline void cpu_uclamp_print(struct seq_file *sf,
> > +   enum uclamp_id clamp_id)
> > [...]
> > +   rcu_read_lock();
> > +   tg = css_tg(seq_css(sf));
> > +   util_clamp = tg->uclamp_req[clamp_id].value;
> > +   rcu_read_unlock();
> > +
> > +   if (util_clamp == SCHED_CAPACITY_SCALE) {
> > +   seq_puts(sf, "max\n");
> > +   return;
> > +   }
> > +
> > +   percent = tg->uclamp_pct[clamp_id];
> 
> You are taking RCU lock when accessing tg->uclamp_req but not when
> accessing tg->uclamp_pct.
Good point.

> Is that intentional? Can tg be destroyed under you?
Actually, the rcu_read{,un}lock should be unnecessary in the context of
the kernfs file op handler -- the tg/css won't go away as long as its
kernfs file is being worked with.



signature.asc
Description: Digital signature


Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Michal Koutný
(+CC cgro...@vger.kernel.org)

On Thu, Aug 08, 2019 at 12:40:02PM -0700, Mina Almasry  
wrote:
> We have developers interested in using hugetlb_cgroups, and they have 
> expressed
> dissatisfaction regarding this behavior.
I assume you still want to enforce a limit on a particular group and the
application must be able to handle resource scarcity (but better
notified than SIGBUS).

> Alternatives considered:
> [...]
(I did not try that but) have you considered:
3) MAP_POPULATE while you're making the reservation,
4) Using multple hugetlbfs mounts with respective limits.

> Caveats:
> 1. This support is implemented for cgroups-v1. I have not tried
>hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
>This is largely because we use cgroups-v1 for now.
Adding something new into v1 without v2 counterpart, is making migration
harder, that's one of the reasons why v1 API is rather frozen now. (I'm
not sure whether current hugetlb controller fits into v2 at all though.)

Michal


signature.asc
Description: Digital signature


Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-08-08 Thread Michal Koutný
On Thu, Aug 08, 2019 at 04:10:21PM +0100, Patrick Bellasi 
 wrote:
> Not sure to get what you mean here: I'm currently exposing uclamp to
> both v1 and v2 hierarchies.
cpu controller has different API for v1 and v2 hierarchies. My question
reworded is -- are the new knobs exposed in the legacy API
intentionally/for a reason?

Michal


signature.asc
Description: Digital signature


Re: [PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps

2019-08-08 Thread Michal Koutný
On Thu, Aug 08, 2019 at 04:08:10PM +0100, Patrick Bellasi 
 wrote:
> Well, if I've got correctly your comment in the previous message, I
> would say that at this stage we don't need RCU looks at all.
Agreed.

> Reason being that cpu_util_update_eff() gets called only from
> cpu_uclamp_write() which is from an ongoing write operation on a cgroup
> attribute and thus granted to be available.
> 
> We will eventually need to move the RCU look only down the stack when
> uclamp_update_active_tasks() gets called to update the RUNNABLE tasks on
> a RQ... or perhaps we don't need them since we already get the
> task_rq_lock() for each task we visit.
Unless you remove css_for_each_descendant_pre() in
cpu_util_update_eff(), the rcu_read_lock() cannot go below it.
(You'd be RCU-accessing other csses that aren't pinned in the write.)

Michal


signature.asc
Description: Digital signature


Re: [PATCH v13 0/6] Add utilization clamping support (CGroups API)

2019-08-06 Thread Michal Koutný
On Fri, Aug 02, 2019 at 10:08:47AM +0100, Patrick Bellasi 
 wrote:
> Patrick Bellasi (6):
>   sched/core: uclamp: Extend CPU's cgroup controller
>   sched/core: uclamp: Propagate parent clamps
>   sched/core: uclamp: Propagate system defaults to root group
>   sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
>   sched/core: uclamp: Update CPU's refcount on TG's clamp changes
>   sched/core: uclamp: always use enum uclamp_id for clamp_id values
Thank you Patrick for your patience. I used the time to revisit the
series once again and I think the RCU locks can be streamlined a bit. If
you find that correct, feel free to add my Reviewed-by to the updated
series (for 1/6 and legacy, I'm just asking).

Michal


signature.asc
Description: Digital signature


Re: [PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps

2019-08-06 Thread Michal Koutný
On Fri, Aug 02, 2019 at 10:08:49AM +0100, Patrick Bellasi 
 wrote:
> @@ -7095,6 +7149,7 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file 
> *of, char *buf,
>   if (req.ret)
>   return req.ret;
>  
> + mutex_lock(_mutex);
>   rcu_read_lock();
>  
>   tg = css_tg(of_css(of));
> @@ -7107,7 +7162,11 @@ static ssize_t cpu_uclamp_write(struct 
> kernfs_open_file *of, char *buf,
>*/
>   tg->uclamp_pct[clamp_id] = req.percent;
>  
> + /* Update effective clamps to track the most restrictive value */
> + cpu_util_update_eff(of_css(of));
> +
>   rcu_read_unlock();
> + mutex_unlock(_mutex);
Following my remarks to "[PATCH v13 1/6] sched/core: uclamp: Extend
CPU's cgroup", I wonder if the rcu_read_lock() couldn't be moved right
before cpu_util_update_eff(). And by extension rcu_read_(un)lock could
be hidden into cpu_util_update_eff() closer to its actual need.



signature.asc
Description: Digital signature


Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-08-06 Thread Michal Koutný
On Fri, Aug 02, 2019 at 10:08:48AM +0100, Patrick Bellasi 
 wrote:
> +static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off,
> + enum uclamp_id clamp_id)
> +{
> + struct uclamp_request req;
> + struct task_group *tg;
> +
> + req = capacity_from_percent(buf);
> + if (req.ret)
> + return req.ret;
> +
> + rcu_read_lock();
This should be the uclamp_mutex.

(The compound results of the series is correct as the lock is introduced
in "sched/core: uclamp: Propagate parent clamps".
This is just for the happiness of cherry-pickers/bisectors.)

> +static inline void cpu_uclamp_print(struct seq_file *sf,
> + enum uclamp_id clamp_id)
> +{
> [...]
> + rcu_read_lock();
> + tg = css_tg(seq_css(sf));
> + util_clamp = tg->uclamp_req[clamp_id].value;
> + rcu_read_unlock();
Why is the rcu_read_lock() needed here? (I'm considering the comment in
of_css() that should apply here (and it seems that similar uses in other
seq_file handlers also skip this).)

> @@ -7369,6 +7506,20 @@ static struct cftype cpu_legacy_files[] = {
> [...]
> + .name = "uclamp.min",
> [...]
> + .name = "uclamp.max",
I don't see technical reasons why uclamp couldn't work on legacy
hierarchy and Tejun acked the series, despite that I'll ask -- should
the new attributes be exposed in v1 controller hierarchy (taking into
account the legacy API is sort of frozen and potential maintenance needs
spanning both hierarchies)?



Re: [PATCH v12 3/6] sched/core: uclamp: Propagate system defaults to root group

2019-07-25 Thread Michal Koutný
On Thu, Jul 18, 2019 at 07:17:45PM +0100, Patrick Bellasi 
 wrote:
> The clamp values are not tunable at the level of the root task group.
> That's for two main reasons:
> 
>  - the root group represents "system resources" which are always
>entirely available from the cgroup standpoint.
> 
>  - when tuning/restricting "system resources" makes sense, tuning must
>be done using a system wide API which should also be available when
>control groups are not.
> 
> When a system wide restriction is available, cgroups should be aware of
> its value in order to know exactly how much "system resources" are
> available for the subgroups.
IIUC, the global default would apply in uclamp_eff_get(), so this
propagation isn't strictly necessary in order to apply to tasks (that's
how it works under !CONFIG_UCLAMP_TASK_GROUP).
The reason is that effective value (which isn't exposed currently) in a
group takes into account this global restriction, right?


> @@ -1043,12 +1063,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
> [...]
> + if (update_root_tg)
> + uclamp_update_root_tg();
> +
>   /*
>* Updating all the RUNNABLE task is expensive, keep it simple and do
>* just a lazy update at each next enqueue time.
Since uclamp_update_root_tg() traverses down to
uclamp_update_active_tasks() is this comment half true now?



signature.asc
Description: Digital signature


Re: [PATCH v12 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-07-25 Thread Michal Koutný
On Thu, Jul 18, 2019 at 07:17:43PM +0100, Patrick Bellasi 
 wrote:
> +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> [...]
> +static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> [...]
These two functions are almost identical yet not trivial. I think it
wouldn be better to have the code at one place only and distinguish by
the passed clamp_id.


signature.asc
Description: Digital signature


Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-19 Thread Michal Koutný
On Tue, Jul 16, 2019 at 10:41:36AM +0800, 王贇   
wrote:
> Actually whatever the memory node sets or cpu allow sets is, it will
> take effect on task's behavior regarding memory location and cpu
> location, while the locality only care about the results rather than
> the sets.
My previous response missed much of the context, so it was a bit off.

I see what you mean by the locality now. Alas, I can't assess whether
it's the right thing to do regarding NUMA behavior that you try to
optimize (i.e. you need an answer from someone more familiar with NUMA
balancing).

Michal


Re: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat

2019-07-19 Thread Michal Koutný
On Tue, Jul 16, 2019 at 11:40:35AM +0800, 王贇   
wrote:
> By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
> output line heading with 'exectime', like:
> 
>   exectime 311900 407166
What you present are times aggregated over CPUs in the NUMA nodes, this
seems a bit lossy interface. 

Despite you the aggregated information is sufficient for your
monitoring, I think it's worth providing the information with the
original granularity.

Note that cpuacct v1 controller used to report such percpu runtime
stats. The v2 implementation would rather build upon the rstat API.

Michal



Re: [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

2019-07-16 Thread Michal Koutný
On Tue, Jul 16, 2019 at 03:34:35PM +0100, Patrick Bellasi 
 wrote:
> Am I missing something?
No, it's rather my misinterpretation of the syscall semantics.

> Otherwise, I think the changelog sentence you quoted is just
> misleading.
It certainly mislead me to thinking about the sched_setattr calls as
requests of utilization being in the given interval (substituting 0 or 1 when
only one boundary is given, and further constrained by tg's interval).

I see your point, those are actually two (mostly) independent controls.
Makes sense now.

Thanks,
Michal


Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group

2019-07-16 Thread Michal Koutný
On Tue, Jul 16, 2019 at 03:34:17PM +0100, Patrick Bellasi 
 wrote:
> > cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> > this should be protected with rcu_read_lock().
> 
> Right, good catch! Will add in v12.
When I responded to your other patch, it occurred to me that since
cpu_util_update_eff goes writing down to child csses, it should take
also uclamp_mutex here to avoid race with direct cgroup attribute
writers.


  1   2   >