Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT
On Tue, Apr 09, 2024 at 11:01:26AM GMT, Steven Rostedt wrote: > > - tpid = pid & (PID_MAX_DEFAULT - 1); > > + tpid = pid % PID_MAP_SIZE; > > Does that compile to the same? This is a fast path. I didn't check. If fast is the intetion, I would change it to something like BUILD_BUG_ON(!(PID_MAP_SIZE % 2)) and keep the bit operation without reliance on compiler optimizations. Thanks for the response (I may not follow up on this single commit though). Michal signature.asc Description: PGP signature
Re: [PATCH 2/3] kernel/pid: Remove default pid_max value
On Mon, Apr 08, 2024 at 04:58:18PM GMT, Michal Koutný wrote: > The kernel provides mechanisms, while it should not imply policies -- > default pid_max seems to be an example of the policy that does not fit > all. At the same time pid_max must have some value assigned, so use the > end of the allowed range -- pid_max_max. > > This change thus increases initial pid_max from 32k to 4M (x86_64 > defconfig). Out of curiosity I dug out the commit acdc721fe26d ("[PATCH] pid-max-2.5.33-A0") v2.5.34~5 that introduced the 32k default. The commit message doesn't say why such a sudden change though. Previously, the limit was 1G of pids (i.e. effectively no default limit like the intention of this series). Honestly, I expected more enthusiasm or reasons against removing the default value of pid_max. Is this really not of interest to anyone? (Thanks, Andrew, for your responses. I don't plan to pursue this further should there be no more interest in having less default limit values in kernel.) Regards, Michal signature.asc Description: PGP signature
Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value
On Thu, Apr 11, 2024 at 03:03:31PM -0700, Andrew Morton wrote: > A large increase in the maximum number of processes. The change from (some) default to effective infinity is the crux of the change. Because that is only a number. (Thus I don't find the number's 12700% increase alone a big change.) Actual maximum amount of processes is "workload dependent" and hence should be determined based on the particular workload. > Or did I misinterpret? I thought you saw an issue with projection of that number into sizings based on the default. Which of them comprises the large change in your eyes? Thanks, Michal
Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value
Hello. On Mon, Apr 08, 2024 at 01:29:55PM -0700, Andrew Morton wrote: > That seems like a large change. In what sense is it large? I tried to lookup the code parts that depend on this default and either add the other patches or mention the impact (that part could be more thorough) in the commit message. > It isn't clear why we'd want to merge this patchset. Does it improve > anyone's life and if so, how? - kernel devs who don't care about policy - policy should be decided by distros/users, not in kernel - users who need many threads - current default is too low - this is one more place to look at when configuring - users who want to prevent fork-bombs - current default is ineffective (too high), false feeling of safety - i.e. they should configure appropriate mechanism appropriately I thought that the first point alone would be convincing and that only scaling impact might need clarification. Regards, Michal
Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang wrote: > It's always non-NULL based on testing. > > It's hard for me to say definitely by reading the code. But IIUC > cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user > space can't set up controllers and config limits, etc., for the diasabled > ones. Each task->cgroups would still have a non-NULL pointer to the static > root object for each cgroup that is enabled by KConfig, so > get_current_misc_cg() thus sgx_get_current_cg() should not return NULL > regardless 'cgroup_disable=misc'. > > Maybe @Michal or @tj can confirm? The current implementation creates root css object (see cgroup_init(), cgroup_ssid_enabled() check is after cgroup_init_subsys()). I.e. it will look like all tasks are members of root cgroup wrt given controller permanently and controller attribute files won't exist. (It is up to the controller implementation to do further optimization based on the boot-time disablement (e.g. see uses of mem_cgroup_disabled()). Not sure if this is useful for misc controller.) As for the WARN_ON(1), taking example from memcg -- NULL is best synonymous with root. It's a judgement call which of the values to store and when to intepret it. HTH, Michal signature.asc Description: PGP signature
[PATCH 2/3] kernel/pid: Remove default pid_max value
pid_max is a per-pidns (thus global too) limit on a number of tasks the kernel admits. The knob can be configured by admin in the range between pid_max_min and pid_max_max (sic). The default value sits between those and it typically equals max(32k, 1k*nr_cpus). The nr_cpu scaling was introduced in commit 72680a191b93 ("pids: increase pid_max based on num_possible_cpus") to accommodate kernel's own helper tasks (before workqueues). Generally, 1024 tasks/cpu cap is too much if they were all running and it is also too little when they are idle (memory being bottleneck). The kernel also provides other mechanisms to restrict number of tasks -- threads-max sysctl and RLIMIT_NPROC with memory-scaled defaults and generic pids cgroup controller (the last one being the solution of fork-bombs, with qualified limits set up by admin). The kernel provides mechanisms, while it should not imply policies -- default pid_max seems to be an example of the policy that does not fit all. At the same time pid_max must have some value assigned, so use the end of the allowed range -- pid_max_max. This change thus increases initial pid_max from 32k to 4M (x86_64 defconfig). This has effect on size of structure that alloc_pid/idr_alloc_cyclic eventually uses and structure that kernel tracing uses with 'record-tgid' (~16 MiB). Signed-off-by: Michal Koutný --- include/linux/pid.h | 4 ++-- include/linux/threads.h | 15 --- kernel/pid.c| 8 +++- 3 files changed, 9 insertions(+), 18 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index a3aad9b4074c..0d191ac02958 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -106,8 +106,8 @@ extern void exchange_tids(struct task_struct *task, struct task_struct *old); extern void transfer_pid(struct task_struct *old, struct task_struct *new, enum pid_type); -extern int pid_max; -extern int pid_max_min, pid_max_max; +extern int pid_max_min, pid_max; +extern const int pid_max_max; /* * look up a PID in the hash table. Must be called with the tasklist_lock diff --git a/include/linux/threads.h b/include/linux/threads.h index c34173e6c5f1..43f8f38a0c13 100644 --- a/include/linux/threads.h +++ b/include/linux/threads.h @@ -22,25 +22,18 @@ #define MIN_THREADS_LEFT_FOR_ROOT 4 -/* - * This controls the default maximum pid allocated to a process - */ -#define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) - /* * A maximum of 4 million PIDs should be enough for a while. * [NOTE: PID/TIDs are limited to 2^30 ~= 1 billion, see FUTEX_TID_MASK.] */ #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \ - (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT)) + (sizeof(long) > 4 ? 4 * 1024 * 1024 : 0x8000)) /* - * Define a minimum number of pids per cpu. Heuristically based - * on original pid max of 32k for 32 cpus. Also, increase the - * minimum settable value for pid_max on the running system based - * on similar defaults. See kernel/pid.c:pid_idr_init() for details. + * Define a minimum number of pids per cpu. Mainly to accommodate + * smpboot_register_percpu_thread() kernel threads. + * See kernel/pid.c:pid_idr_init() for details. */ -#define PIDS_PER_CPU_DEFAULT 1024 #define PIDS_PER_CPU_MIN 8 #endif diff --git a/kernel/pid.c b/kernel/pid.c index da76ed1873f7..24ae505ac3b0 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -60,10 +60,10 @@ struct pid init_struct_pid = { }, } }; -int pid_max = PID_MAX_DEFAULT; +int pid_max = PID_MAX_LIMIT; int pid_max_min = RESERVED_PIDS + 1; -int pid_max_max = PID_MAX_LIMIT; +const int pid_max_max = PID_MAX_LIMIT; /* * Pseudo filesystems start inode numbering after one. We use Reserved * PIDs as a natural offset. @@ -652,9 +652,7 @@ void __init pid_idr_init(void) /* Verify no one has done anything silly: */ BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING); - /* bump default and minimum pid_max based on number of cpus */ - pid_max = min(pid_max_max, max_t(int, pid_max, - PIDS_PER_CPU_DEFAULT * num_possible_cpus())); + /* bump minimum pid_max based on number of cpus */ pid_max_min = max_t(int, pid_max_min, PIDS_PER_CPU_MIN * num_possible_cpus()); pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min); -- 2.44.0
[PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT
Calculations into map_pid_to_cmdline use PID_MAX_DEFAULT but they actually depend on the size of map_pid_to_cmdline. The size of the map may be arbitrary. First, refer to the map size where necessary, second, pick a good value for the size of the map. Since the buffer is allocated at boot (i.e. user cannot affect its size later), accounting for full PID_MAX_LIMIT would inflate map's size unnecessarily (4*4M) for all users. Stick to the original value of 4*32k, the commit 785e3c0a3a87 ("tracing: Map all PIDs to command lines") explains why it still works for higher pids. The point of this exercise is to remove dependency on PID_MAX_DEFAULT. Signed-off-by: Michal Koutný --- kernel/trace/trace_sched_switch.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index 8a407adb0e1c..aca2dafdd97a 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -161,6 +161,7 @@ static size_t tgid_map_max; #define SAVED_CMDLINES_DEFAULT 128 #define NO_CMDLINE_MAP UINT_MAX +#define PID_MAP_SIZE (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) /* * Preemption must be disabled before acquiring trace_cmdline_lock. * The various trace_arrays' max_lock must be acquired in a context @@ -168,7 +169,7 @@ static size_t tgid_map_max; */ static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED; struct saved_cmdlines_buffer { - unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1]; + unsigned map_pid_to_cmdline[PID_MAP_SIZE]; unsigned *map_cmdline_to_pid; unsigned cmdline_num; int cmdline_idx; @@ -248,7 +249,7 @@ int trace_save_cmdline(struct task_struct *tsk) if (!tsk->pid) return 1; - tpid = tsk->pid & (PID_MAX_DEFAULT - 1); + tpid = tsk->pid % PID_MAP_SIZE; /* * It's not the end of the world if we don't get @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[]) return; } - tpid = pid & (PID_MAX_DEFAULT - 1); + tpid = pid % PID_MAP_SIZE; map = savedcmd->map_pid_to_cmdline[tpid]; if (map != NO_CMDLINE_MAP) { tpid = savedcmd->map_cmdline_to_pid[map]; @@ -645,8 +646,8 @@ tracing_saved_cmdlines_size_write(struct file *filp, const char __user *ubuf, if (ret) return ret; - /* must have at least 1 entry or less than PID_MAX_DEFAULT */ - if (!val || val > PID_MAX_DEFAULT) + /* must have at least 1 entry or fit into map_pid_to_cmdline */ + if (!val || val >= PID_MAP_SIZE) return -EINVAL; ret = tracing_resize_saved_cmdlines((unsigned int)val); -- 2.44.0
[PATCH 3/3] tracing: Compare pid_max against pid_list capacity
trace_pid_list_alloc() checks pid_max against a magic number referencing an (obsolete) source file when it actually should check against the capacity of pid_list tree. Turn definition of MAX_PID around -- derive it from tree parameters and replace references to magic value and header files with so defined MAX_PID. Should PID_MAX_LIMIT change in future or pid_max escapes PID_MAX_LIMIT, appropriate checks remain in place. No functional change intended. Signed-off-by: Michal Koutný --- kernel/trace/pid_list.c | 6 +++--- kernel/trace/pid_list.h | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c index 95106d02b32d..b968f0b65dc1 100644 --- a/kernel/trace/pid_list.c +++ b/kernel/trace/pid_list.c @@ -93,7 +93,7 @@ static inline bool upper_empty(union upper_chunk *chunk) static inline int pid_split(unsigned int pid, unsigned int *upper1, unsigned int *upper2, unsigned int *lower) { - /* MAX_PID should cover all pids */ + /* MAX_PID must cover all possible pids */ BUILD_BUG_ON(MAX_PID < PID_MAX_LIMIT); /* In case a bad pid is passed in, then fail */ @@ -413,8 +413,8 @@ struct trace_pid_list *trace_pid_list_alloc(void) struct trace_pid_list *pid_list; int i; - /* According to linux/thread.h, pids can be no bigger that 30 bits */ - WARN_ON_ONCE(pid_max > (1 << 30)); + /* See pid_split(), equal to pid_max > PID_MAX_LIMIT */ + WARN_ON_ONCE(pid_max > MAX_PID); pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL); if (!pid_list) diff --git a/kernel/trace/pid_list.h b/kernel/trace/pid_list.h index 62e73f1ac85f..28562a9a3d01 100644 --- a/kernel/trace/pid_list.h +++ b/kernel/trace/pid_list.h @@ -56,8 +56,8 @@ #define UPPER_MASK (UPPER_MAX - 1) -/* According to linux/thread.h pids can not be bigger than or equal to 1 << 30 */ -#define MAX_PID(1 << 30) +/* Structure can hold only pids strictly below this limit */ +#define MAX_PID(1 << (UPPER_BITS + UPPER_BITS + LOWER_BITS)) /* Just keep 6 chunks of both upper and lower in the cache on alloc */ #define CHUNK_ALLOC 6 -- 2.44.0
[PATCH 0/3] kernel/pid: Remove default pid_max value
TL;DR excerpt from commit 02/03: The kernel provides mechanisms, while it should not imply policies -- default pid_max seems to be an example of the policy that does not fit all. At the same time pid_max must have some value assigned, so use the end of the allowed range -- pid_max_max. More details are in that commit's message. The other two commits are related preparation and less related refresh in code that somewhat references pid_max. Michal Koutný (3): tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT kernel/pid: Remove default pid_max value tracing: Compare pid_max against pid_list capacity include/linux/pid.h | 4 ++-- include/linux/threads.h | 15 --- kernel/pid.c | 8 +++- kernel/trace/pid_list.c | 6 +++--- kernel/trace/pid_list.h | 4 ++-- kernel/trace/trace_sched_switch.c | 11 ++- 6 files changed, 20 insertions(+), 28 deletions(-) base-commit: fec50db7033ea478773b159e0e2efb135270e3b7 -- 2.44.0
Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang wrote: > Do we really want to have it implemented in c? I only pointed to the available C boilerplate. > There are much fewer lines of > code in shell scripts. Note we are not really testing basic cgroup stuff. > All we needed were creating/deleting cgroups and set limits which I think > have been demonstrated feasible in the ash scripts now. I assume you refer to Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com> right? Could it be even simpler if you didn't stick to cgtools APIs and v1 compatibility? Reducing ash_cgexec.sh to something like echo 0 >$R/$1/cgroup.procs shift exec "$@" (with some small builerplate for $R and previous mkdirs) Thanks, Michal signature.asc Description: PGP signature
Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
Hello. On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen wrote: > > > It'd be more complicated and less readable to do all the stuff without > > > the > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends > > > > > > on libc so I hope this would not cause too much inconvenience. > > > > As per cgroup-tools, please prove this. It makes the job for more > > complicated *for you* and you are making the job more complicated > > to every possible person in the planet running any kernel QA. > > > > I weight the latter more than the former. And it is exactly the > > reason why we did custom user space kselftest in the first place. > > Let's keep the tradition. All I can say is that kselftest is > > unfinished in its current form. > > > > What is "esp cgexec"? > > Also in kselftest we don't drive ultimate simplicity, we drive > efficient CI/QA. By open coding something like subset of > cgroup-tools needed to run the test you also help us later > on to backtrack the kernel changes. With cgroups-tools you > would have to use strace to get the same info. FWIW, see also functions in tools/testing/selftests/cgroup/cgroup_util.{h,c}. They likely cover what you need already -- if the tests are in C. (I admit that stuff in tools/testing/selftests/cgroup/ is best understood with strace.) HTH, Michal signature.asc Description: PGP signature
Re: Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
Hello. On Mon, Feb 26, 2024 at 03:48:18PM -0600, Haitao Huang wrote: > In case of overcomitting, i.e., sum of limits greater than the EPC capacity, > if one group has a fault, and its usage is not above its own limit > (try_charge() passes), yet total usage of the system has exceeded the > capacity, whether we do global reclaim or just reclaim pages in the current > faulting group. > > > Also, what does the core mm memcg code do? > > > I'm not sure. I'll try to find out but it'd be appreciated if someone more > knowledgeable can comment on this. memcg also has the protection mechanism > (i.e., min, low settings) to guarantee some allocation per group so its > approach might not be applicable to misc controller here. I only follow the discussion superficially but it'd be nice to have analogous mechanisms in memcg and sgx controller. The memory limits are rather simple -- when allocating new memory, the tightest limit of ancestor applies and reclaim is applied to whole subtree of such an ancestor (not necessearily the originating cgroup of the allocation). Overcommit is admited, competition among siblings is resolved on the first comes, first served basis. The memory protections are an additional (and in a sense orthogoal) mechanism. They can be interpretted as limits that are enforced not at the time of allocation but at the time of reclaim (and reclaim is triggered independetly, either global or with the limits above). The consequence is that the protection code must do some normalization to evaluate overcommited values sensibly. (Historically, there was also memory.soft_limit_in_bytes, which combined "protection" and global reclaim but it turned out broken. I suggest reading Documentation/admin-guide/cgroup-v2.rst section Controller Issues and Remedies/Memory for more details and as cautionary example.) HTH, Michal signature.asc Description: PGP signature
Re: [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
On Mon, Feb 05, 2024 at 01:06:27PM -0800, Haitao Huang wrote: > +static int sgx_epc_cgroup_alloc(struct misc_cg *cg); > + > +const struct misc_res_ops sgx_epc_cgroup_ops = { > + .alloc = sgx_epc_cgroup_alloc, > + .free = sgx_epc_cgroup_free, > +}; > + > +static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup > *epc_cg) > +{ > + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg; > + epc_cg->cg = cg; > +} This is a possibly a nit pick but I share it here for consideration. Would it be more prudent to have the signature like alloc(struct misc_res *res, struct misc_cg *cg) so that implementations are free of the assumption of how cg and res are stored? Thanks, Michal signature.asc Description: PGP signature
Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai" wrote: > I am not sure, but is it possible or legal for an ancestor to have less limit > than children? Why not? It is desired for proper hiearchical delegation and the tightest limit of ancestors applies (cf memory.max). Michal signature.asc Description: PGP signature
Re: Re: [RFC PATCH] rpmsg: glink: Add bounds check on tx path
On Mon, Jan 29, 2024 at 04:18:36PM +0530, Deepak Kumar Singh wrote: > There is already a patch posted for similar problem - > https://lore.kernel.org/all/20231201110631.669085-1-quic_dee...@quicinc.com/ I was not aware, thanks for the pointer. Do you plan to update your patch to "just" bail-out/zero instead of using slightly random values (as pointed out by Bjorn)? Michal signature.asc Description: PGP signature
[RFC PATCH] rpmsg: glink: Add bounds check on tx path
Add bounds check on values read from shared memory in the tx path. In cases where the VM is misbehaving, the transport should exit and print a warning when bogus values may cause out of bounds to be read. Link: https://git.codelinaro.org/clo/la/kernel/msm-5.10/-/commit/32d9c3a2f2b6a4d1fc48d6871194f3faf3184e8b Suggested-by: Chris Lew Suggested-by: Sarannya S Signed-off-by: Michal Koutný --- drivers/rpmsg/qcom_glink_smem.c | 9 + 1 file changed, 9 insertions(+) Why RFC? The patch is adopted from the link above. It would be good to asses whether such conditions can also happen with rpmsg glink. (And if so, whether the zeroed values are the best correction.) diff --git a/drivers/rpmsg/qcom_glink_smem.c b/drivers/rpmsg/qcom_glink_smem.c index 7a982c60a8dd..3e786e590c03 100644 --- a/drivers/rpmsg/qcom_glink_smem.c +++ b/drivers/rpmsg/qcom_glink_smem.c @@ -146,6 +146,11 @@ static size_t glink_smem_tx_avail(struct qcom_glink_pipe *np) else avail -= FIFO_FULL_RESERVE + TX_BLOCKED_CMD_RESERVE; + if (avail > pipe->native.length) { + pr_warn_once("%s: avail clamped\n", __func__); + avail = 0; + } + return avail; } @@ -177,6 +182,10 @@ static void glink_smem_tx_write(struct qcom_glink_pipe *glink_pipe, unsigned int head; head = le32_to_cpu(*pipe->head); + if (head > pipe->native.length) { + pr_warn_once("%s: head overflow\n", __func__); + return; + } head = glink_smem_tx_write_one(pipe, head, hdr, hlen); head = glink_smem_tx_write_one(pipe, head, data, dlen); -- 2.43.0
Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events
On Mon, Oct 30, 2023 at 11:20:02AM -0700, Haitao Huang wrote: > From: Kristen Carlson Accardi > > The misc cgroup controller (subsystem) currently does not perform > resource type specific action for Cgroups Subsystem State (CSS) events: > the 'css_alloc' event when a cgroup is created and the 'css_free' event > when a cgroup is destroyed. > > Define callbacks for those events and allow resource providers to > register the callbacks per resource type as needed. This will be > utilized later by the EPC misc cgroup support implemented in the SGX > driver. I notice now that the callbacks are per resource and per cgroup. I think that is an overkill that complicates misc_cg allocation code with parent's callbacks while freeing is done by own callbacks. It's possibly too much liberal to keep objects' lifecycle understandable in the long term too. For some uniformity, I'd suggest struct blkcg_policy array in block/blk-cgroup.c as the precedent. (Perhaps indexed with static enum misc_res_type instead of dynamic index assignment as blkcg_policy_registeer() does.) I hope you don't really need per-cgroup callbacks :-) Michal signature.asc Description: PGP signature
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
Hello. On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang wrote: > Thanks for raising this. Actually my understanding the above discussion was > mainly about not doing reclaiming by killing enclaves, i.e., I assumed > "reclaiming" within that context only meant for that particular kind. > > As Mikko pointed out, without reclaiming per-cgroup, the max limit of each > cgroup needs to accommodate the peak usage of enclaves within that cgroup. > That may be inconvenient for practical usage and limits could be forced to > be set larger than necessary to run enclaves performantly. For example, we > can observe following undesired consequences comparing a system with current > kernel loaded with enclaves whose total peak usage is greater than the EPC > capacity. > > 1) If a user wants to load the same exact enclaves but in separate cgroups, > then the sum of cgroup limits must be higher than the capacity and the > system will end up doing the same old global reclaiming as it is currently > doing. Cgroup is not useful at all for isolating EPC consumptions. That is the use of limits to prevent a runaway cgroup smothering the system. Overcommited values in such a config are fine because the more simultaneous runaways, the less likely. The peak consumption is on the fair expense of others (some efficiency) and the limit contains the runaway (hence the isolation). > 2) To isolate impact of usage of each cgroup on other cgroups and yet still > being able to load each enclave, the user basically has to carefully plan to > ensure the sum of cgroup max limits, i.e., the sum of peak usage of > enclaves, is not reaching over the capacity. That means no over-commiting > allowed and the same system may not be able to load as many enclaves as with > current kernel. Another "config layout" of limits is to achieve partitioning (sum == capacity). That is perfect isolation but it naturally goes against efficient utilization. The way other controllers approach this trade-off is with weights (cpu, io) or protections (memory). I'm afraid misc controller is not ready for this. My opinion is to start with the simple limits (first patches) and think of prioritization/guarantee mechanism based on real cases. HTH, Michal signature.asc Description: PGP signature
Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen wrote: > 1. Admin sets a limit > 2. Enclave is created > 3. Enclave hits limit, allocation fails I was actually about to suggest reorganizing the series to a part implementing this simple limiting and a subsequent part with the reclaim stuff for easier digestability. > Nothing else matters. If the latter part is an unncessary overkill, it's even better. Thanks, Michal signature.asc Description: PGP signature
Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný wrote: > Is this distinction between preemptability of EPC pages mandated by the > HW implementation? (host/"process" enclaves vs VM enclaves) Or do have > users an option to lock certain pages in memory that yields this > difference? (After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC") Or would these two types warrant also two types of miscresource? (To deal with each in own way.) Thanks, Michal signature.asc Description: PGP signature
Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
On Fri, Sep 22, 2023 at 08:06:40PM -0700, Haitao Huang wrote: > @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct > kernfs_open_file *of, char *buf, > > cg = css_misc(of_css(of)); > > - if (READ_ONCE(misc_res_capacity[type])) > + if (READ_ONCE(misc_res_capacity[type])) { > WRITE_ONCE(cg->res[type].max, max); > - else > + if (cg->res[type].max_write) > + cg->res[type].max_write(cg); > + } else { > ret = -EINVAL; > > + } Is it time for a misc_cg_mutex? This given no synchronization guarantees to implementors of max_write. (Alternatively, document it that the callback must implement own synchronization.) Thanks, Michal signature.asc Description: PGP signature
Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
On Fri, Sep 22, 2023 at 08:06:55PM -0700, Haitao Huang wrote: > +static void sgx_epc_cgroup_free(struct misc_cg *cg) > +{ > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(cg); It should check for !epc_cg since the misc controller implementation in misc_cg_alloc() would roll back even on non-allocated resources. > + cancel_work_sync(_cg->reclaim_work); > + kfree(epc_cg); > +} > + > +static void sgx_epc_cgroup_max_write(struct misc_cg *cg) > +{ > + struct sgx_epc_reclaim_control rc; > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(cg); > + > + sgx_epc_reclaim_control_init(, epc_cg); > + /* Let the reclaimer to do the work so user is not blocked */ > + queue_work(sgx_epc_cg_wq, _cg->reclaim_work); This is weird. The writer will never learn about the result of the operation. Thanks, Michal signature.asc Description: PGP signature
Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
Hello Haitao. On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang wrote: > AFAIK, before we introducing max_write() callback in this series, no misc > controller would possibly enforce the limit when misc.max is reduced. e.g. I > don't think CVMs be killed when ASID limit is reduced and the cgroup was > full before limit is reduced. Yes, misccontroller was meant to be simple, current >= max serves to prevent new allocations. FTR, at some point in time memory.max was considered for reclaim control of regular pages but it turned out to be too coarse (and OOM killing processes if amount was not sensed correctly) and this eventually evolved into specific mechanism of memory.reclaim. So I'm mentioning this should that be an interface with better semantic for your use case (and misc.max writes can remain non-preemptive). One more note -- I was quite confused when I read in the rest of the series about OOM and _kill_ing but then I found no such measure in the code implementation. So I would suggest two terminological changes: - the basic premise of the series (00/18) is that EPC pages are a different resource than memory, hence choose a better suiting name than OOM (out of memory) condition, - killing -- (unless you have an intention to implement process termination later) My current interpretation that it is rather some aggressive unmapping within address space, so less confusing name for that would be "reclaim". > I think EPC pages to VMs could have the same behavior, once they are given > to a guest, never taken back by the host. For enclaves on host side, pages > are reclaimable, that allows us to enforce in a similar way to memcg. Is this distinction between preemptability of EPC pages mandated by the HW implementation? (host/"process" enclaves vs VM enclaves) Or do have users an option to lock certain pages in memory that yields this difference? Regards, Michal signature.asc Description: PGP signature
Re: [PATCH 0/9] sched: Core scheduling interfaces
On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra wrote: > IMO as long as cgroups have that tasks file, you get to support people > using it. That means that tasks joining your cgroup need to 'inherit' > cgroup properties. The tasks file is consequence of binding this to cgroups, I'm one step back. Why to make "core isolation" a cgroup property? (I understand this could help "visualize" what the common domains are if cgroups were the only API but with prctl the structure can be arbitrarily modified anyway.) > Given something like: > > R >/ \ > A B > / \ > C D Thanks for the example. > B group can set core_sched=1 and then all its (and its decendants) tasks > get to have the same (group) cookie and cannot share with others. The same could be achieved with the first task of group B allocating its new cookie which would be inherited in its descednants. > If however B is a delegate and has a subgroup D that is security > sensitive and must not share core resources with the rest of B, then it > can also set D.core_sched=1, such that D (and its decendants) will have > another (group) cookie. If there is such a sensitive descendant task, it could allocate a new cookie (same way as the first one in B did). > On top of this, say C has a Real-Time tasks, that wants to limit SMT > interference, then it can set a (task/prctl) cookie on itself, such that > it will not share the core with the rest of the tasks of B. (IIUC, in this particular example it'd be redundant if B had no inner tasks since D isolated itself already.) Yes, so this is again the same pattern as the tasks above have done. > In that scenario the D subtree is a restriction (doesn't share) with the > B subtree. This implies D's isolation from everything else too, not just B's members, no? > And all of B is a restriction on all its tasks, including the Real-Time > task that set a task cookie, in that none of them can share with tasks > outside of B (including system tasks which are in R), irrespective of > what they do with their task cookie. IIUC, the equivalent restriction could be achieved with the PTRACE-like check in the prctl API too (with respectively divided uids). I'm curious whether the cgroup API actually simplifies things that are possible with the clone/prctl API or allows anything that wouldn't be otherwise possible. Regards, Michal signature.asc Description: Digital signature
Re: [PATCH 0/9] sched: Core scheduling interfaces
Hello. IIUC, the premise is that the tasks that have different cookies imply they would never share a core. On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote: > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It > is > however changed such that you can have nested tags. The for any given task, > the > first parent with a cookie is the effective one. The rationale is that this > way > you can delegate subtrees and still allow them some control over grouping. Given the existence of prctl and clone APIs, I don't see the reason to have a separate cgroup-bound interface too (as argued by Tejun). The potential speciality is the possibility to re-tag whole groups of processes at runtime (but the listed use cases [1] don't require that and it's not such a good idea given its asynchronicity). Also, I would find useful some more explanation how the hierarchical behavior is supposed to work. In my understanding the task is either allowed to request its own isolation or not. The cgroups could be used to restrict this privilege, however, that doesn't seem to be the case here. My two cents, Michal [1] https://lore.kernel.org/lkml/20200822030155.ga414...@google.com/ signature.asc Description: Digital signature
Re: [PATCH 06/11] cgroup: fix -Wzero-length-bounds warnings
On Tue, Mar 30, 2021 at 11:00:36AM +0200, Arnd Bergmann wrote: > Would it be possible to enclose most or all of kernel/cgroup/cgroup.c > in an #ifdef CGROUP_SUBSYS_COUNT block? Even without any controllers, there can still be named hierarchies (v1) or the default hierarchy (v2) (for instance) for process tracking purposes. So only parts of kernel/cgroup/cgroup.c could be ifdef'd. Beware that CGROUP_SUBSYS_COUNT is not known at preprocessing stage (you could have a macro alternative though). > I didn't try that myself, but this might be a way to guarantee that > there cannot be any callers (it would cause a link error). Such a guarantee would be nicer, I agree. I tried a bit but anandoned it when I saw macros proliferate (which I found less readable than your current variant). But YMMV. Michal signature.asc Description: Digital signature
Re: [PATCH 06/11] cgroup: fix -Wzero-length-bounds warnings
On Mon, Mar 22, 2021 at 05:02:44PM +0100, Arnd Bergmann wrote: > I'm not sure what is expected to happen for such a configuration, > presumably these functions are never calls in that case. Yes, the functions you patched would only be called from subsystems or there should be no way to obtain a struct cgroup_subsys reference anyway (hence it's ok to always branch as if ss==NULL). I'd prefer a variant that wouldn't compile the affected codepaths when there are no subsystems registered, however, I couldn't come up with a way how to do it without some preprocessor ugliness. Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH] mm: memcontrol: switch to rstat fix
On Mon, Mar 15, 2021 at 07:41:00PM -0400, Johannes Weiner wrote: > Switch to the atomic variant, cgroup_rstat_irqsafe(). Congratulations(?), the first use of cgroup_rstat_irqsafe(). Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [Patch v3 0/2] cgroup: New misc cgroup controller
On Fri, Mar 12, 2021 at 09:49:26AM -0800, Vipin Sharma wrote: > I will add some more information in the cover letter of the next version. Thanks. > Each one coming up with their own interaction is a duplicate effort > when they all need similar thing. Could this be expressed as a new BPF hook (when allocating/freeing such a resource unit)? The decision could be made based on the configured limit or even some other predicate. (I saw this proposed already but I haven't seen some more reasoning whether it's worse/better. IMO, BPF hooks are "cheaper" than full-blown controllers, though it's still new user API.) > As per my understanding this is the only for way for loadable modules > (kvm-amd in this case) to access Kernel APIs. Let me know if there is a > better way to do it. I understood the symbols are exported for such modularized builds. However, making them non-GPL exposes them to any out-of-tree modules, although, the resource types are supposed to stay hardcoded in the misc controller. So my point was to make them EXPORT_SYMBOL_GPL to mark they're just a means of implementing the modularized builds and not an API. (But they'd remain API for out-of-tree GPL modules anyway, so take this reasoning of mine with a grain of salt.) Michal signature.asc Description: Digital signature
Re: [Patch v3 1/2] cgroup: sev: Add misc cgroup controller
On Fri, Mar 12, 2021 at 11:07:14AM -0800, Vipin Sharma wrote: > We should be fine without atomic64_t because we are using unsigned > long and not 64 bit explicitly. This will work on both 32 and 64 bit > machines. I see. > But I will add READ_ONCE and WRITE_ONCE because of potential chances of > load tearing and store tearing. > > Do you agree? Yes. > This was only here to avoid multiple reads of capacity and making sure > if condition and seq_print will see the same value. Aha. > Also, I was not aware of load and store tearing of properly aligned > and machine word size variables. I will add READ_ONCE and WRITE_ONCE > at other places. Yeah, although it's theoretical, I think it also serves well to annotate such unsychronized accesses. Thanks, Michal signature.asc Description: Digital signature
Re: [Patch v3 1/2] cgroup: sev: Add misc cgroup controller
Hi Vipin. On Thu, Mar 04, 2021 at 03:19:45PM -0800, Vipin Sharma wrote: > arch/x86/kvm/svm/sev.c| 65 +- > arch/x86/kvm/svm/svm.h| 1 + > include/linux/cgroup_subsys.h | 4 + > include/linux/misc_cgroup.h | 130 +++ > init/Kconfig | 14 ++ > kernel/cgroup/Makefile| 1 + > kernel/cgroup/misc.c | 402 ++ Given different two-fold nature (SEV caller vs misc controller) of some remarks below, I think it makes sense to split this into two patches: a) generic controller implementation, b) hooking the controller into SEV ASIDs management. > +#ifndef CONFIG_KVM_AMD_SEV > +/* > + * When this config is not defined, SEV feature is not supported and APIs in > + * this file are not used but this file still gets compiled into the KVM AMD > + * module. > + * > + * We will not have MISC_CG_RES_SEV and MISC_CG_RES_SEV_ES entries in the > enum > + * misc_res_type {} defined in linux/misc_cgroup.h. BTW, was there any progress on conditioning sev.c build on CONFIG_KVM_AMD_SEV? (So that the defines workaround isn't needeed.) > static int sev_asid_new(struct kvm_sev_info *sev) > { > - int pos, min_asid, max_asid; > + int pos, min_asid, max_asid, ret; > bool retry = true; > + enum misc_res_type type; > + > + type = sev->es_active ? MISC_CG_RES_SEV_ES : MISC_CG_RES_SEV; > + sev->misc_cg = get_current_misc_cg(); > + ret = misc_cg_try_charge(type, sev->misc_cg, 1); It may be safer to WARN_ON(sev->misc_cg) at this point (see below). > [...] > +e_uncharge: > + misc_cg_uncharge(type, sev->misc_cg, 1); > + put_misc_cg(sev->misc_cg); > + return ret; vvv > @@ -140,6 +171,10 @@ static void sev_asid_free(int asid) > } > > mutex_unlock(_bitmap_lock); > + > + type = sev->es_active ? MISC_CG_RES_SEV_ES : MISC_CG_RES_SEV; > + misc_cg_uncharge(type, sev->misc_cg, 1); > + put_misc_cg(sev->misc_cg); It may be safer to set sev->misc_cg to NULL here. (IIUC, with current asid_{new,free} calls it shouldn't matter but why to rely on it in the future.) > +++ b/kernel/cgroup/misc.c > [...] > +static void misc_cg_reduce_charge(enum misc_res_type type, struct misc_cg > *cg, > + unsigned long amount) misc_cg_cancel_charge seems to be a name more consistent with what we already have in pids and memory controller. > +static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off) > +{ > [...] > + > + if (!strcmp(MAX_STR, buf)) { > + max = ULONG_MAX; MAX_NUM for consistency with other places. > + } else { > + ret = kstrtoul(buf, 0, ); > + if (ret) > + return ret; > + } > + > + cg = css_misc(of_css(of)); > + > + if (misc_res_capacity[type]) > + cg->res[type].max = max; In theory, parallel writers can clash here, so having the limit atomic type to prevent this would resolve it. See also commit a713af394cf3 ("cgroup: pids: use atomic64_t for pids->limit"). > +static int misc_cg_current_show(struct seq_file *sf, void *v) > +{ > + int i; > + struct misc_cg *cg = css_misc(seq_css(sf)); > + > + for (i = 0; i < MISC_CG_RES_TYPES; i++) { > + if (misc_res_capacity[i]) Since there can be some residual charges after removing capacity (before draining), maybe the condition along the line if (misc_res_capacity[i] || atomic_long_read(>res[i].usage)) would be more informative for debugging. > +static int misc_cg_capacity_show(struct seq_file *sf, void *v) > +{ > + int i; > + unsigned long cap; > + > + for (i = 0; i < MISC_CG_RES_TYPES; i++) { > + cap = READ_ONCE(misc_res_capacity[i]); Why is READ_ONCE only here and not in other places that (actually) check against the set capacity value? Also, there should be a paired WRITE_ONCCE in misc_cg_set_capacity(). Thanks, Michal signature.asc Description: Digital signature
Re: [Patch v3 0/2] cgroup: New misc cgroup controller
Hello. On Sun, Mar 07, 2021 at 07:48:40AM -0500, Tejun Heo wrote: > Vipin, thank you very much for your persistence and patience. Yes, and thanks for taking my remarks into account. > Michal, as you've been reviewing the series, can you please take > another look and ack them if you don't find anything objectionable? Honestly, I'm still sitting on the fence whether this needs a new controller and whether the miscontroller (:-p) is a good approach in the long term [1]. I admit, I didn't follow the past dicussions completely, however, (Vipin) could it be in the cover letter/commit messages shortly summarized why cgroups and a controller were chosen to implement restrictions of these resources, what were the alternatives any why were they rejected? In the previous discussion, I saw the reasoning for the list of the resources to be hardwired in the controller itself in order to get some scrutiny of possible changes. That makes sense to me. But with that, is it necessary to commit to the new controller API via EXPORT_SYMBOL? (I don't mean this as a licensing question but what the external API should be (if any).) Besides the generic remarks above, I'd still suggest some slight implementation changes, posted inline to the patch. Thanks, Michal [1] Currently, only one thing comes to my mind -- the delegation via cgroup.subtree_control. The miscontroller may add possibly further resources whose delegation granularity is bunched up under one entry. signature.asc Description: Digital signature
Re: [PATCH v2] sched/cpuacct: Fix charge cpuacct.usage_sys incorrently.
Hello. (Sorry for necroposting, found this upstream reference only now.) On Mon, Apr 20, 2020 at 03:04:53PM +0800, Muchun Song wrote: > /* Time spent by the tasks of the CPU accounting group executing in ... */ > @@ -339,7 +340,7 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime) > { > struct cpuacct *ca; > int index = CPUACCT_STAT_SYSTEM; > - struct pt_regs *regs = task_pt_regs(tsk); > + struct pt_regs *regs = get_irq_regs() ? : task_pt_regs(tsk); I've read the discussion in [1] but I don't think this approach is correct either (and I don't know what is better :-/). I only have a qualitative proof: host:~ # uname -r 5.10.16-1-default host:~ # systemd-run -p CPUAccounting=yes sh -c 'time sh -c "i=0 ; while [ \"\$i\" -lt 1 ] ; do i=\$((\$i+1)) ; cat /proc/slabinfo >/dev/null ; done" ; sleep inf' Running as unit: run-r101b9f53efcb4d2a9bfb65feb6f120ca.service host:~ # cat /sys/fs/cgroup/cpuacct/system.slice/run-r101b9f53efcb4d2a9bfb65feb6f120ca.service/cpuacct.usage{,_user,_sys} 16138535165 14332580468 1805954697 (See that sys/user ~ 0.1) host:~ # journalctl -u run-r101b9f53efcb4d2a9bfb65feb6f120ca.service -- Logs begin at Tue 2021-03-02 18:06:41 CET, end at Tue 2021-03-02 18:27:45 CET. -- Mar 02 18:27:29 host systemd[1]: Started /usr/bin/sh -c time sh -c "i=0 ; while [ \"\$i\" -lt 1 ] ; do i=\$((\$i+1)) ; cat /proc/slabinfo >/dev/null ; done" ; sleep inf. Mar 02 18:27:45 host sh[19117]: real0m15.543s Mar 02 18:27:45 host sh[19117]: user0m10.752s Mar 02 18:27:45 host sh[19117]: sys0m5.379s (See that sys/user ~ 0.5) host:~ # cat /sys/fs/cgroup/cpuacct/system.slice/run-r101b9f53efcb4d2a9bfb65feb6f120ca.service/cpuacct.stat user 415 system 1209 (See that sys/user ~ 3.0 :-o) The expectation is that significant amount of the loop is spent in kernel (dumping slabinfo). I can't tell which of the ratios fits the reality best but the cpuacct.usage_sys still seems too low. Michal [1] https://lore.kernel.org/lkml/20200416141833.50663-1-songmuc...@bytedance.com/ signature.asc Description: Digital signature
Re: [RFC 1/2] cgroup: sev: Add misc cgroup controller
On Thu, Feb 25, 2021 at 11:28:46AM -0800, Vipin Sharma wrote: > My approach here is that it is the responsibility of the caller to: > 1. Check the return value and proceed accordingly. > 2. Ideally, let all of the usage be 0 before deactivating this resource >by setting capacity to 0 If the calling side can ensure itself that no new units of the resource are used from that moment on, then it can work this way -- but describe that in misc_cg_set_capacity() comment. > Is the above change good? I think both alternatives would work. But the latter (as I see it now) would mandate dependency on CONFIG_CGROUP or it'd have to double the similar logic itself. So maybe keeping the caller responsible explicitly is simpler from this POV. > Will there be any objection to extra information? IMO it's unnecessary (stating this just for consistency reasons), no strong opinion. Michal signature.asc Description: Digital signature
[RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance
leaf_cfs_rq_list should contain cfs_rqs that have runnable entities in them. When we're operating under a throttled hierarchy we always update the leaf_cfs_rq_list in order no to break list_add_leaf_cfs_rq invariant of adding complete branches. This caused troubles when an entity became runnable (enqueue_entity) under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix enqueue_task_fair() warning some more")). While it proved well, this new change ignores updating leaf_cfs_rq_list when we're operating under the throttled hierarchy and defers the leaf_cfs_rq_list update to the point when whole hiearchy is unthrottled (tg_unthrottle_up). The code is now simpler and leaf_cfs_rq_list contains only cfs_rqs that are truly runnable. Why is this RFC? - Primarily, I'm not sure I interpreted the purpose of leaf_cfs_rq_list right. The removal of throttled cfs_rqs from it would exclude them from __update_blocked_fair() calculation and I can't see past it now. If it's wrong assumption, I'd like this to help clarify what the proper definition of leaf_cfs_rq_list would be. - Additionally, I didn't check thoroughly for corner cases when se->on_rq => cfs_rq_of(se)->on_list wouldn't hold, so the patch certainly isn't finished. Signed-off-by: Michal Koutný --- kernel/sched/fair.c | 41 ++--- kernel/sched/sched.h | 4 +--- 2 files changed, 7 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..634ba6637824 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4250,10 +4250,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* * When bandwidth control is enabled, cfs might have been removed -* because of a parent been throttled but cfs->nr_running > 1. Try to -* add it unconditionnally. +* because of a parent been throttled. We'll add it later (with +* complete branch up to se->on_rq/cfs_eq->on_list) in +* tg_unthrottle_up() and unthrottle_cfs_rq(). */ - if (cfs_rq->nr_running == 1 || cfs_bandwidth_used()) + if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq)) list_add_leaf_cfs_rq(cfs_rq); if (cfs_rq->nr_running == 1) @@ -4859,6 +4860,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) for_each_sched_entity(se) { if (se->on_rq) break; + // XXX: se->on_rq implies cfs_rq_of(se)->on_list (unless throttled_hierarchy) cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP); @@ -4896,17 +4898,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) add_nr_running(rq, task_delta); unthrottle_throttle: - /* -* The cfs_rq_throttled() breaks in the above iteration can result in -* incomplete leaf list maintenance, resulting in triggering the -* assertion below. -*/ - for_each_sched_entity(se) { - cfs_rq = cfs_rq_of(se); - - if (list_add_leaf_cfs_rq(cfs_rq)) - break; - } assert_list_leaf_cfs_rq(rq); @@ -5518,6 +5509,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) for_each_sched_entity(se) { if (se->on_rq) break; + // XXX: se->on_rq implies cfs_rq_of(se)->on_list (unless throttled_hierarchy) cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); @@ -5544,13 +5536,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) goto enqueue_throttle; - - /* -* One parent has been throttled and cfs_rq removed from the -* list. Add it back to not break the leaf list. -*/ - if (throttled_hierarchy(cfs_rq)) - list_add_leaf_cfs_rq(cfs_rq); } /* At this point se is NULL and we are at root level*/ @@ -5574,20 +5559,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_overutilized_status(rq); enqueue_throttle: - if (cfs_bandwidth_used()) { - /* -* When bandwidth control is enabled; the cfs_rq_throttled() -* breaks in the above iteration can result in incomplete -* leaf list maintenance, resulting in triggering the assertion -* below. -*/ - for_each_sched_entity(se) { - cfs_rq = cfs_rq_of(se); - - if (list_add_leaf_cfs_rq(cfs_rq)) - break; - } - }
Re: [RFC 1/2] cgroup: sev: Add misc cgroup controller
On Wed, Feb 24, 2021 at 08:57:36PM -0800, Vipin Sharma wrote: > This function is meant for hot unplug functionality too. Then I'm wondering if the current form is sufficient, i.e. the generic controller can hardly implement preemption but possibly it should prevent any additional charges of the resource. (Or this can be implemented the other subsystem and explained in the misc_cg_set_capacity() docs.) > Just to be on the same page are you talking about adding an events file > like in pids? Actually, I meant just the kernel log message. As it's the simpler part and even pid events have some inconsistencies wrt hierarchical reporting. > However, if I take reference at the first charge and remove reference at > last uncharge then I can keep the ref count in correct sync. I see now how it works. I still find it a bit complex. What about making misc_cg an input parameter and making it the callers responsibility to keep a reference? (Perhaps with helpers for the most common case.) Thanks, Michal signature.asc Description: Digital signature
Re: [PATCH] mm: memcontrol: fix slub memory accounting
On Tue, Feb 23, 2021 at 05:24:23PM +0800, Muchun Song wrote: > mm/slab_common.c | 4 ++-- > mm/slub.c| 8 > 2 files changed, 6 insertions(+), 6 deletions(-) Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [RFC 1/2] cgroup: sev: Add misc cgroup controller
On Thu, Feb 18, 2021 at 11:55:48AM -0800, Vipin Sharma wrote: > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > [...] > +#ifndef CONFIG_KVM_AMD_SEV > +/* > + * When this config is not defined, SEV feature is not supported and APIs in > + * this file are not used but this file still gets compiled into the KVM AMD > + * module. I'm not familiar with the layout of KVM/SEV compile targets but wouldn't it be simpler to exclude whole svm/sev.c when !CONFIG_KVM_AMD_SEV? > +++ b/kernel/cgroup/misc.c > [...] > +/** > + * misc_cg_set_capacity() - Set the capacity of the misc cgroup res. > + * @type: Type of the misc res. > + * @capacity: Supported capacity of the misc res on the host. > + * > + * If capacity is 0 then the charging a misc cgroup fails for that type. > + * > + * The caller must serialize invocations on the same resource. > + * > + * Context: Process context. > + * Return: > + * * %0 - Successfully registered the capacity. > + * * %-EINVAL - If @type is invalid. > + * * %-EBUSY - If current usage is more than the capacity. When is this function supposed to be called? At boot only or is this meant for some kind of hot unplug functionality too? > +int misc_cg_try_charge(enum misc_res_type type, struct misc_cg **cg, > +unsigned int amount) > [...] > + new_usage = atomic_add_return(amount, >usage); > + if (new_usage > res->max || > + new_usage > misc_res_capacity[type]) { > + ret = -EBUSY; I'm not sure the user of this resource accounting will always be able to interpret EBUSY returned from depths of the subsystem. See what's done in pids controller in order to give some useful information about why operation failed. > + goto err_charge; > + } > + > + // First one to charge gets a reference. > + if (new_usage == amount) > + css_get(>css); 1) Use the /* comment */ style. 2) You pin the whole path from task_cg up to root (on the first charge). That's unnecessary since children reference their parents. Also why do you get the reference only for the first charger? While it may work, it seems too convoluted to me. It'd be worth documenting what the caller can expect wrt to ref count of the returned misc_cg. Thanks, Michal signature.asc Description: Digital signature
Re: [RFC 0/2] cgroup: New misc cgroup controller
Hello. On Thu, Feb 18, 2021 at 11:55:47AM -0800, Vipin Sharma wrote: > This patch is creating a new misc cgroup controller for allocation and > tracking of resources which are not abstract like other cgroup > controllers. Please don't refer to this as "allocation" anywhere, that has a specific meaning (see Resource Distribution Models in Documentation/admin-gruide/cgroup-v2.rst). > This controller was initially proposed as encryption_id but after > the feedbacks, it is now changed to misc cgroup. > https://lore.kernel.org/lkml/20210108012846.4134815-2-vipi...@google.com/ Interesting generalization. I wonder what else could fit under this as well. (It resembles pids controller on the cover.) > Please provide any feedback for this RFC or if it is good for > merging then I can send a patch for merging. A new controller is added exposed with v1 attributes. I'm not convinced it is desirable to change the frozen v1 controllers' API? (And therefore promote it as well.) Beware, bikeshedding. The name is very non-descriptive, potentially suggesting catch-all semantics. It'd deserve a further thought. My idea would be limit(s) or counter controller. My few cents here (and some more in reply to the patch), Michal signature.asc Description: Digital signature
Re: [PATCH 01/20] cgroup: Manual replacement of the deprecated strlcpy() with return values
Hello. On Mon, Feb 22, 2021 at 04:12:12PM +0100, Romain Perier wrote: > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2265,7 +2265,7 @@ int task_cgroup_path(struct task_struct *task, char > *buf, size_t buflen) Actually, this function isn't used at all. So I'd instead propose the patch below. -- >8 -- From 4f7e0b9c0412f60e0b0e8b7d1ef6eb2790dca567 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Tue, 23 Feb 2021 17:05:57 +0100 Subject: [PATCH] cgroup: Drop task_cgroup_path() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The function has no current users and it is a remnant from kdbus enthusiasm era 857a2beb09ab ("cgroup: implement task_cgroup_path_from_hierarchy()"). Drop it to eliminate unused code. Suggested-by: Romain Perier Signed-off-by: Michal Koutný --- include/linux/cgroup.h | 1 - kernel/cgroup/cgroup.c | 39 --- 2 files changed, 40 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 4f2f79de083e..e9c41b15fd4e 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -115,7 +115,6 @@ int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); int cgroup_rm_cftypes(struct cftype *cfts); void cgroup_file_notify(struct cgroup_file *cfile); -int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen); int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry); int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c80fe99f85ae..d75ffd461222 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2235,45 +2235,6 @@ int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen, } EXPORT_SYMBOL_GPL(cgroup_path_ns); -/** - * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy - * @task: target task - * @buf: the buffer to write the path into - * @buflen: the length of the buffer - * - * Determine @task's cgroup on the first (the one with the lowest non-zero - * hierarchy_id) cgroup hierarchy and copy its path into @buf. This - * function grabs cgroup_mutex and shouldn't be used inside locks used by - * cgroup controller callbacks. - * - * Return value is the same as kernfs_path(). - */ -int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen) -{ - struct cgroup_root *root; - struct cgroup *cgrp; - int hierarchy_id = 1; - int ret; - - mutex_lock(_mutex); - spin_lock_irq(_set_lock); - - root = idr_get_next(_hierarchy_idr, _id); - - if (root) { - cgrp = task_cgroup_from_root(task, root); - ret = cgroup_path_ns_locked(cgrp, buf, buflen, _cgroup_ns); - } else { - /* if no hierarchy exists, everyone is in "/" */ - ret = strlcpy(buf, "/", buflen); - } - - spin_unlock_irq(_set_lock); - mutex_unlock(_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(task_cgroup_path); - /** * cgroup_migrate_add_task - add a migration target task to a migration context * @task: target task -- 2.30.1 signature.asc Description: Digital signature
Re: [PATCH v3 6/8] mm: memcontrol: switch to rstat
On Tue, Feb 09, 2021 at 11:33:02AM -0500, Johannes Weiner wrote: > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. IIUC, on the unified hierarchy the rstat tree built from writer notifications is shared with all controllers. (Currently, it means memory updates are merged with base stats cpu time updates.) This is just for reference, to stress the current rstat design. > --- > include/linux/memcontrol.h | 67 ++- > mm/memcontrol.c| 223 +++-- > 2 files changed, 132 insertions(+), 158 deletions(-) Well done. Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v3 4/8] cgroup: rstat: support cgroup1
On Wed, Feb 17, 2021 at 03:52:59PM -0500, Johannes Weiner wrote: > It's possible, but I don't think worth the trouble. You're right. I gave it a deeper look and what would be saved on data, would be paid in code complexity. > In this case, we're talking about a relatively small data structure > and the overhead is per mountpoint. IIUC, it is per each mountpoint's number of cgroups. But I still accept the argument above. Furthermore, this can be changed later. > The default root group has statically preallocated percpu data before > and after this patch. See cgroup.c: I stand corrected, the comment is still valid. Therefore, Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v3 8/8] kselftests: cgroup: update kmem test for new vmstat implementation
On Tue, Feb 09, 2021 at 11:33:04AM -0500, Johannes Weiner wrote: > --- > tools/testing/selftests/cgroup/test_kmem.c | 22 ++ > 1 file changed, 14 insertions(+), 8 deletions(-) Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v3 2/8] mm: memcontrol: kill mem_cgroup_nodeinfo()
On Tue, Feb 09, 2021 at 11:32:58AM -0500, Johannes Weiner wrote: > include/linux/memcontrol.h | 8 +--- > mm/memcontrol.c| 21 +++-- > 2 files changed, 12 insertions(+), 17 deletions(-) Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v3 3/8] mm: memcontrol: privatize memcg_page_state query functions
On Tue, Feb 09, 2021 at 11:32:59AM -0500, Johannes Weiner wrote: > include/linux/memcontrol.h | 44 -- > mm/memcontrol.c| 32 +++ > 2 files changed, 32 insertions(+), 44 deletions(-) Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v3 1/8] mm: memcontrol: fix cpuhotplug statistics flushing
On Tue, Feb 09, 2021 at 11:32:57AM -0500, Johannes Weiner wrote: > mm/memcontrol.c | 35 +-- > 1 file changed, 21 insertions(+), 14 deletions(-) Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v3 4/8] cgroup: rstat: support cgroup1
Hello. On Tue, Feb 09, 2021 at 11:33:00AM -0500, Johannes Weiner wrote: > @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 > ss_mask) > if (ret) > goto destroy_root; > > - ret = rebind_subsystems(root, ss_mask); > + ret = cgroup_rstat_init(root_cgrp); Would it make sense to do cgroup_rstat_init() only if there's a subsys in ss_mask that makes use of rstat? (On legacy systems there could be individual hierarchy for each controller so the rstat space can be saved.) > @@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup > *parent, const char *name, > if (ret) > goto out_free_cgrp; > > - if (cgroup_on_dfl(parent)) { > - ret = cgroup_rstat_init(cgrp); > - if (ret) > - goto out_cancel_ref; > - } > + ret = cgroup_rstat_init(cgrp); And here do cgroup_rstat_init() only when parent has it. > @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void) > > for_each_possible_cpu(cpu) > raw_spin_lock_init(per_cpu_ptr(_rstat_cpu_lock, cpu)); > - > - BUG_ON(cgroup_rstat_init(_dfl_root.cgrp)); > } Regardless of the suggestion above, this removal obsoletes the comment cgroup_rstat_init: int cpu; -/* the root cgrp has rstat_cpu preallocated */ if (!cgrp->rstat_cpu) { cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu); Regards, Michal signature.asc Description: Digital signature
Re: [PATCH] mm: memcg/slab: optimize objcg stock draining
On Tue, Jan 05, 2021 at 08:22:39PM -0800, Roman Gushchin wrote: > Imran Khan reported a regression in hackbench results caused by the > commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects > instead of pages"). The regression is noticeable in the case of > a consequent allocation of several relatively large slab objects, > e.g. skb's. As soon as the amount of stocked bytes exceeds PAGE_SIZE, > drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads > to a number of atomic operations in page_counter_uncharge(). > > The corresponding call graph is below (provided by Imran Khan): > |__alloc_skb > || > ||__kmalloc_reserve.isra.61 > ||| > |||__kmalloc_node_track_caller > |||| > ||||slab_pre_alloc_hook.constprop.88 > ||| obj_cgroup_charge > ||||| > |||||__memcg_kmem_charge > |||||| > ||||||page_counter_try_charge > ||||| > |||||refill_obj_stock > |||||| > ||||||drain_obj_stock.isra.68 <--- draining old memcg > ||||||| > |||||||__memcg_kmem_uncharge > |||||||| > ||||||||page_counter_uncharge > ||||||||| > |||||||||page_counter_cancel > [...] > - page_counter_uncharge(>memory, nr_pages); > - if (do_memsw_account()) > - page_counter_uncharge(>memsw, nr_pages); > + refill_stock(memcg, nr_pages); I noticed that refill_stock(memcg,...) switches the local stock to memcg. In this particular call chain, the uncharged memcg is the old memcg of stock->cached_objcg. The refill_stock() then may switch stock->cached to the old memcg too. If the patch leads to better performance, then the switch probably doesn't happen at this moment (and I guess stock->cached_objcg and stock->cached can be independent to some extent, so the old memcg in one needn't be the old in the latter). In conclusion Reviewed-by: Michal Koutný Michal signature.asc Description: Digital signature
Re: [PATCH v3] cgroup-v1: add disabled controller check in cgroup1_parse_param()
On Fri, Jan 15, 2021 at 05:37:17PM +0800, Chen Zhou wrote: > [...] > kernel/cgroup/cgroup-v1.c | 3 +++ > 1 file changed, 3 insertions(+) Reviewed-by: Michal Koutný signature.asc Description: Digital signature
Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()
On Fri, Jan 15, 2021 at 09:55:43AM +0800, chenzhou wrote: > Yeah, this will select all enabled controllers, but which doesn't the > behavior we want. I see what the issue is now. > See above. Just the mount behavior isn't what we what. I agree this a bug and your I find your approach correct Reviewed-by: Michal Koutný > The behavior was changed since commit f5dfb5315d34 ("cgroup: take > options parsing into ->parse_monolithic()"), will add this as Fixes. Thanks. Michal signature.asc Description: Digital signature
Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()
On Thu, Jan 14, 2021 at 10:08:19PM +0800, chenzhou wrote: > In this case, at the beginning of function check_cgroupfs_options(), the mask > ctx->subsys_mask will be 0. And if we mount without 'none' and 'name=' > options, > then in check_cgroupfs_options(), the flag ctx->all_ss will be set, that is, > select all the subsystems. But even then, the line https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup-v1.c?h=v5.11-rc3#n1012 would select only 'enabled' controllers, wouldn't it? It's possible I missed something but if this means that cgroup_no_v1= doesn't hold to its expectations, I'd suggest adding proper Fixes: tag to the patch. Thanks, Michal signature.asc Description: Digital signature
Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()
Hello Chen. On Fri, Dec 18, 2020 at 02:17:55PM +0800, Chen Zhou wrote: > When mounting a cgroup hierarchy with disabled controller in cgroup v1, > all available controllers will be attached. Not sure if I understand the situation -- have you observed a v1 controller attached to a hierarchy while specifying cgroup_no_v1= kernel cmdline arg? AFAICS, the disabled controllers are honored thanks to check_cgroupfs_options(). Michal signature.asc Description: Digital signature
Re: [PATCH] cgroup: Remove unnecessary call to strstrip()
Hello. On Sun, Jan 03, 2021 at 02:50:01AM +, Hao Lee wrote: > The string buf will be stripped in cgroup_procs_write_start() before it > is converted to int, so remove this unnecessary call to strstrip(). Good catch, Hao. Perhaps the code be then simplified a bit -- >8 -- From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Thu, 14 Jan 2021 13:23:39 +0100 Subject: [PATCH] cgroup: cgroup.{procs,threads} factor out common parts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The functions cgroup_threads_start and cgroup_procs_start are almost identical. In order to reduce duplication, factor out the common code in similar fashion we already do for other threadgroup/task functions. No functional changes are intended. Suggested-by: Hao Lee Signed-off-by: Michal Koutný --- kernel/cgroup/cgroup.c | 55 +++--- 1 file changed, 14 insertions(+), 41 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 613845769103c..f6279df507f4b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4726,8 +4726,8 @@ static int cgroup_attach_permissions(struct cgroup *src_cgrp, return ret; } -static ssize_t cgroup_procs_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) +static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, + bool threadgroup) { struct cgroup *src_cgrp, *dst_cgrp; struct task_struct *task; @@ -4738,7 +4738,7 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of, if (!dst_cgrp) return -ENODEV; - task = cgroup_procs_write_start(buf, true, ); + task = cgroup_procs_write_start(buf, threadgroup, ); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -4748,19 +4748,26 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of, src_cgrp = task_cgroup_from_root(task, _dfl_root); spin_unlock_irq(_set_lock); + /* process and thread migrations follow same delegation rule */ ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, - of->file->f_path.dentry->d_sb, true); + of->file->f_path.dentry->d_sb, threadgroup); if (ret) goto out_finish; - ret = cgroup_attach_task(dst_cgrp, task, true); + ret = cgroup_attach_task(dst_cgrp, task, threadgroup); out_finish: cgroup_procs_write_finish(task, locked); out_unlock: cgroup_kn_unlock(of->kn); - return ret ?: nbytes; + return ret; +} + +static ssize_t cgroup_procs_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + return __cgroup_procs_write(of, buf, true) ?: nbytes; } static void *cgroup_threads_start(struct seq_file *s, loff_t *pos) @@ -4771,41 +4778,7 @@ static void *cgroup_threads_start(struct seq_file *s, loff_t *pos) static ssize_t cgroup_threads_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - struct cgroup *src_cgrp, *dst_cgrp; - struct task_struct *task; - ssize_t ret; - bool locked; - - buf = strstrip(buf); - - dst_cgrp = cgroup_kn_lock_live(of->kn, false); - if (!dst_cgrp) - return -ENODEV; - - task = cgroup_procs_write_start(buf, false, ); - ret = PTR_ERR_OR_ZERO(task); - if (ret) - goto out_unlock; - - /* find the source cgroup */ - spin_lock_irq(_set_lock); - src_cgrp = task_cgroup_from_root(task, _dfl_root); - spin_unlock_irq(_set_lock); - - /* thread migrations follow the cgroup.procs delegation rule */ - ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, - of->file->f_path.dentry->d_sb, false); - if (ret) - goto out_finish; - - ret = cgroup_attach_task(dst_cgrp, task, false); - -out_finish: - cgroup_procs_write_finish(task, locked); -out_unlock: - cgroup_kn_unlock(of->kn); - - return ret ?: nbytes; + return __cgroup_procs_write(of, buf, false) ?: nbytes; } /* cgroup core interface files for the default hierarchy */ -- 2.29.2
Re: [PATCH] mm: memcontrol: prevent starvation when writing memory.high
On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, true); > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > + GFP_KERNEL, true); Although I was also initially confused by throwing 'reclaimed' info away, the patch makes sense to me given the reasoning. It is Reviewed-by: Michal Koutný As for the discussed unsuccessful retries, I'd keep it a separate change. Regards, Michal signature.asc Description: Digital signature
Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
Hi. On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt wrote: > > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value > > can change at any particular moment. The switch can happen only when singular (i.e. root-only) hierarchy exists. (Or it could if rebind_subsystems() waited until all memcgs are completely free'd.) > Since the commit 0158115f702b0 ("memcg, kmem: deprecate > kmem.limit_in_bytes"), we are in the process of deprecating the limit > on kmem. If we decide that now is the time to deprecate it, we can > convert the kmem page counter to a memcg stat, update it for both v1 > and v2 and serve v1's kmem.usage_in_bytes from that memcg stat. So with the single memcg, it may be possible to reconstruct the necessary counters in both directions using the statistics (or some complementarity, without fine grained counters removal). I didn't check all the charging/uncharging places, these are just my 2 cents to the issue. Michal signature.asc Description: Digital signature
Re: [PATCH v2] mm: memcg: link page counters to root if use_hierarchy is false
Hi. On Mon, Oct 26, 2020 at 04:13:26PM -0700, Roman Gushchin wrote: > Please note, that in the non-hierarchical mode all objcgs are always > reparented to the root memory cgroup, even if the hierarchy has more > than 1 level. This patch doesn't change it. > > The patch also doesn't affect how the hierarchical mode is working, > which is the only sane and truly supported mode now. I agree with the patch and you can add Reviewed-by: Michal Koutný However, it effectively switches any users of root.use_hierarchy=0 (if there are any, watching the counters of root memcg) into root.use_hierarchy=1. So I'd show them the warning even with a single level of cgroups, i.e. add this hunk @@ -5356,12 +5356,11 @@ page_counter_init(>kmem, _mem_cgroup->kmem); page_counter_init(>tcpmem, _mem_cgroup->tcpmem); /* -* Deeper hierachy with use_hierarchy == false doesn't make +* Hierachy with use_hierarchy == false doesn't make * much sense so let cgroup subsystem know about this * unfortunate state in our controller. */ - if (parent != root_mem_cgroup) - memory_cgrp_subsys.broken_hierarchy = true; + memory_cgrp_subsys.broken_hierarchy = true; } /* The following stuff does not apply to the root */ What do you think? Michal signature.asc Description: Digital signature
Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
Hi. On Tue, Oct 20, 2020 at 06:52:08AM +0100, Richard Palethorpe wrote: > I don't think that is relevant as we get the memcg from objcg->memcg > which is set during reparenting. I suppose however, we can determine if > the objcg was reparented by inspecting memcg->objcg. +1 > If we just check use_hierarchy then objects directly charged to the > memcg where use_hierarchy=0 will not be uncharged. However, maybe it is > better to check if it was reparented and if use_hierarchy=0. I think (I had to make a table) the yielded condition would be: if ((memcg->use_hierarchy && reparented) || (!mem_cgroup_is_root(memcg) && !reparented)) __memcg_kmem_uncharge(memcg, nr_pages); (I admit it's not very readable.) Michal signature.asc Description: Digital signature
Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
Hi. On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin wrote: > Currently the root memory cgroup is never charged directly, but > if an ancestor cgroup is charged, the charge is propagated up to the s/ancestor/descendant/ > The root memory cgroup doesn't show the charge to a user, neither it > does allow to set any limits/protections. An appealing claim, I'd like this to be true... > Please, note, that cgroup v1 provides root level memory.usage_in_bytes. > However, it's not based on page counters (refer to mem_cgroup_usage()). ...and it almost is. But there are still exposed kmem and tcpmem counters. > To avoid multiple identical checks over the page counters > code, for_each_nonroot_ancestor() macro is introduced. If the assumptions behind this patch's idea were true, I think the implementation would be simpler by merely (not)connecting the root counters and keep the traversal as is. > direct ascendants of the corresponding root memory cgroup's page s/asc/desc/ ;-) Michal signature.asc Description: Digital signature
Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
On Fri, Oct 16, 2020 at 04:05:21PM +0100, Richard Palethorpe wrote: > I'm don't know if that could happen without reparenting. I suppose if > use_hierarchy=1 then actually this patch will result in root being > overcharged, so perhaps it should also check for use_hierarchy? Right, you'd need to distinguish whether the uncharged objcg was originally (not)charged in the root memcg or it was only reparented to it. (I originally considered only the genuine root objcgs.) In this light, homogenous charing to root memcg looks really simpler but I wonder what other surprises it brings about. Michal signature.asc Description: Digital signature
Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner wrote: > The central try_charge() function charges recursively all the way up > to and including the root. Except for use_hiearchy=0 (which is the case here as Richard wrote). The reparenting is hence somewhat incompatible with new_parent.use_hiearchy=0 :-/ > We should clean this up one way or another: either charge the root or > don't, but do it consistently. I agree this'd be good to unify. One upside of excluding root memcg from charging is that users are spared from the charging overhead when memcg tree is not created. (Actually, I thought that was the reason for this exception.) Michal signature.asc Description: Digital signature
Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
Hello. On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe wrote: > SLAB objects which outlive their memcg are moved to their parent > memcg where they may be uncharged. However if they are moved to the > root memcg, uncharging will result in negative page counter values as > root has no page counters. Why do you think those are reparented objects? If those are originally charged in a non-root cgroup, then the charge value should be propagated up the hierarchy, including root memcg, so if they're later uncharged in root after reparenting, it should still break even. (Or did I miss some stock imbalance?) (But the patch seems justifiable to me as objects (not)charged directly to root memcg may be incorrectly uncharged.) Thanks, Michal signature.asc Description: Digital signature
Re: [PATCH v4 4/4] mm: convert page kmemcg type to a page memcg flag
On Thu, Oct 01, 2020 at 01:27:13PM -0400, Johannes Weiner wrote: > The activation code is the only path where page migration is not > excluded. Because unlike with page state statistics, we don't really > mind a race when counting an activation event. Thanks for the explanation. I see why the accessor trio is justified. > I do think there is a bug, though: mem_cgroup_move_account() should > use WRITE_ONCE() on page->mem_cgroup. If this were a bug, wouldn't be the proper approach rcu_assign_pointer()/rcu_dereference() pair? Michal signature.asc Description: Digital signature
Re: [PATCH v4 4/4] mm: convert page kmemcg type to a page memcg flag
Hi. On Wed, Sep 30, 2020 at 05:27:10PM -0700, Roman Gushchin wrote: > @@ -369,8 +371,12 @@ enum page_memcg_data_flags { > */ > static inline struct mem_cgroup *page_memcg(struct page *page) > { > + unsigned long memcg_data = page->memcg_data; > + > VM_BUG_ON_PAGE(PageSlab(page), page); > - return (struct mem_cgroup *)page->memcg_data; > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > + > + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > } Shouldn't this change go also into page_memcg_rcu()? (I don't think the current single user (workingset_activation() would pass a non-slab kernel page but for consistency sake.) Alternatively, I'm thinking why (in its single use) is there page_memcg_rcu() a separate function to page_memcg() (cross memcg page migration?). Regards, Michal signature.asc Description: Digital signature
Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
On Tue, Aug 11, 2020 at 12:32:28PM -0700, Roman Gushchin wrote: > If we'll limit init.slice (where systemd seems to reside), as you suggest, > we'll eventually create trashing in init.slice, followed by OOM. > I struggle to see how it makes the life of a user better? > [...] > The problem is that OOM-killing the accompanying manager won't release > resources and help to get rid of accumulated cgroups. I see your point now. I focused on the creator because of the live memcgs. When the issue are the dying memcgs (c), they were effectively released by their creator but are pinned by whatever remained after their life (LRU pages, slab->obj_cgroups). Since these pins were created _from within_ such a child (c), they're most readily removable by reclaiming (hierarchically) close to c. (It'd be achievable by limiting the lowest common ancestor of manager and its product (typically root) but that is more clumsy and less effective.) This is the reasoning that justifies the remote charge. Thanks! Michal signature.asc Description: Digital signature
Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin wrote: > As I said, there are 2 problems with charging systemd (or a similar daemon): > 1) It often belongs to the root cgroup. This doesn't hold for systemd (if we agree that systemd is the most common case). > 2) OOMing or failing some random memory allocations is a bad way >to "communicate" a memory shortage to the daemon. >What we really want is to prevent creating a huge number of cgroups There's cgroup.max.descendants for that... >(including dying cgroups) in some specific sub-tree(s). ...oh, so is this limiting the number of cgroups or limiting resources used? >OOMing the daemon or returning -ENOMEM to some random syscalls >will not help us to reach the goal and likely will bring a bad >experience to a user. If we reach the situation when memory for cgroup operations is tight, it'll disappoint the user either way. My premise is that a running workload is more valuable than the accompanying manager. > In a generic case I don't see how we can charge the cgroup which > creates cgroups without solving these problems first. In my understanding, "onbehalveness" is a concept useful for various kernel threads doing deferred work. Here it's promoted to user processes managing cgroups. > And if there is a very special case where we have to limit it, > we can just add an additional layer: > > ` root or delegated root >` manager-parent-cgroup-with-a-limit > ` manager-cgroup (systemd, docker, ...) >` [aggregation group(s)] > ` job-group-1 > ` ... > ` job-group-n If the charge goes to the parent of created cgroup (job-cgroup-i here), then the layer adds nothing. Am I missing something? > I'd definitely charge the parent cgroup in all similar cases. (This would mandate the controllers on the unified hierarchy, which is fine IMO.) Then the order of enabling controllers on a subtree (e.g. cpu,memory vs memory,cpu) by the manager would yield different charging. This seems wrong^W confusing to me. Thanks, Michal signature.asc Description: Digital signature
Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface
Hi Shakeel. On Tue, Jul 07, 2020 at 10:02:50AM -0700, Shakeel Butt wrote: > > Well, I was talkingg about memory.low. It is not meant only to protect > > from the global reclaim. It can be used for balancing memory reclaim > > from _any_ external memory pressure source. So it is somehow related to > > the usecase you have mentioned. > > > > For the uswapd use-case, I am not concerned about the external memory > pressure source but the application hitting its own memory.high limit > and getting throttled. FTR, you can transform own memory.high into "external" pressure with a hierarchy such as limit-group memory.high=N+margin memory.low=0 `- latency-sensitive-groupmemory.low=N `- regular-group memory.low=0 Would that ensure the latency targets? Michal signature.asc Description: Digital signature
Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin wrote: > In general, yes. But in this case I think it wouldn't be a good idea: > most often cgroups are created by a centralized daemon (systemd), > which is usually located in the root cgroup. Even if it's located not in > the root cgroup, limiting it's memory will likely affect the whole system, > even if only one specific limit was reached. The generic scheme would be (assuming the no internal process constraint, in the root too) ` root or delegated root ` manager-cgroup (systemd, docker, ...) ` [aggregation group(s)] ` job-group-1 ` ... ` job-group-n > If there is a containerized workload, which creates sub-cgroups, > charging it's parent cgroup is perfectly effective. No dispute about this in either approaches. > And the opposite, if we'll charge the cgroup of a process, who created > a cgroup, we'll not cover the most common case: systemd creating > cgroups for all services in the system. What I mean is that systemd should be charged for the cgroup creation. Or more generally, any container/cgroup manager should be charged. Consider a leak when it wouldn't remove spent cgroups, IMO the effect (throttling, reclaim) should be exercised on such a culprit. I don't think the existing workload (job-group-i above) should unnecessarily suffer when only manager is acting up. Is that different from your idea? > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups > in completely different cgroup. In this particular case there are tons of > other > ways how a task from C00 can hurt C1. I agree with that. If I haven't overlooked anything, this should be first case when cgroup-related structures are accounted (please correct me). So this is setting a precendent, if others show useful to be accounted in the future too. I'm thinking about cpu_cgroup_css_alloc() that can also allocate a lot (with big CPU count). The current approach would lead situations where matching cpu and memory csses needn't to exist and that would need special handling. > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote: > > These week-old issues appear to be significant. Roman? Or someone > > else? Despite my concerns, I don't think this is fundamental and can't be changed later so it doesn't prevent the inclusion in 5.9 RC1. Regards, Michal signature.asc Description: Digital signature
Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
Hello. On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote: > Because the size of memory cgroup internal structures can dramatically > exceed the size of object or page which is pinning it in the memory, it's > not a good idea to simple ignore it. It actually breaks the isolation > between cgroups. No doubt about accounting the memory if it's significant amount. > Let's account the consumed percpu memory to the parent cgroup. Why did you choose charging to the parent of the created cgroup? Should the charge go the cgroup _that is creating_ the new memcg? One reason is that there are the throttling mechanisms for memory limits and those are better exercised when the actor and its memory artefact are the same cgroup, aren't they? The second reason is based on the example Dlegation Containment (Documentation/admin-guide/cgroup-v2.rst) > For an example, let's assume cgroups C0 and C1 have been delegated to > user U0 who created C00, C01 under C0 and C10 under C1 as follows and > all processes under C0 and C1 belong to U0:: > > ~ - C0 - C00 > ~ cgroup~ \ C01 > ~ hierarchy ~ > ~ - C1 - C10 Thanks to permissions a task running in C0 creating a cgroup in C1 would deplete C1's supply victimizing tasks inside C1. Thanks, Michal signature.asc Description: Digital signature
[PATCH 3/3] docs: cgroup: No special handling of unpopulated memcgs
The current kernel doesn't handle unpopulated cgroups any special regarding reclaim protection. Furthermore, this wasn't a case even when this was introduced in bf8d5d52ffe89 ("memcg: introduce memory.min") Drop the incorrect documentation. (Implementation taking into account the inner-node constraint may be added later.) Signed-off-by: Michal Koutný --- Documentation/admin-guide/cgroup-v2.rst | 3 --- 1 file changed, 3 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 47f9f056e66f..3d62922c4499 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1123,9 +1123,6 @@ PAGE_SIZE multiple when read back. Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs. - If a memory cgroup is not populated with processes, - its memory.min is ignored. - memory.low A read-write single value file which exists on non-root cgroups. The default is "0". -- 2.27.0
[PATCH 0/3] Memory reclaim documentation fixes
Hi, I think the reclaim target is a concept that is not just an implementation detail and hence it should be documented how it applies to protection configuration (the first patch). Second patch is a "best practice" bit of information, it may be squashed with the first one. The last patch just makes docs indefinite until the idea is implemented. Michal Koutný (3): docs: cgroup: Explain reclaim protection target docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: No special handling of unpopulated memcgs Documentation/admin-guide/cgroup-v2.rst | 31 - 1 file changed, 25 insertions(+), 6 deletions(-) -- 2.27.0
[PATCH 1/3] docs: cgroup: Explain reclaim protection target
Signed-off-by: Michal Koutný --- Documentation/admin-guide/cgroup-v2.rst | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index d09471aa7443..94bdff4f9e09 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -47,7 +47,8 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst
[PATCH 2/3] docs: cgroup: Note about sibling relative reclaim protection
Signed-off-by: Michal Koutný --- Documentation/admin-guide/cgroup-v2.rst | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 94bdff4f9e09..47f9f056e66f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1467,6 +1467,10 @@ targets ancestors of A, the effective protection of B is capped by the protection value configured for A (and any other intermediate ancestors between A and the target). +To express indifference about relative sibling protection, it is suggested to +use memory_recursiveprot. Configuring all descendants of a parent with finite +protection to "max" works but it may unnecessarily skew memory.events:low +field. Memory Ownership -- 2.27.0
[PATCH] /proc/PID/smaps: Consistent whitespace output format
The keys in smaps output are padded to fixed width with spaces. All except for THPeligible that uses tabs (only since commit c06306696f83 ("mm: thp: fix false negative of shmem vma's THP eligibility")). Unify the output formatting to save time debugging some naïve parsers. (Part of the unification is also aligning FilePmdMapped with others.) Signed-off-by: Michal Koutný --- fs/proc/task_mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index dbda4499a859..5066b0251ed8 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -786,7 +786,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss, SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); - SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); + SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ", mss->private_hugetlb >> 10, 7); @@ -816,7 +816,7 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, , false); - seq_printf(m, "THPeligible: %d\n", + seq_printf(m, "THPeligible:%d\n", transparent_hugepage_enabled(vma)); if (arch_pkeys_enabled()) -- 2.27.0
Re: [PATCH v4 2/3] nsproxy: attach to namespaces via pidfds
On Wed, Jun 24, 2020 at 01:54:56PM +0200, Christian Brauner wrote: > Yep, I already have a fix for this in my tree based on a previous > report from LTP. Perfect. (Sorry for the noise then.) Thanks, Michal signature.asc Description: Digital signature
Re: [PATCH v4 2/3] nsproxy: attach to namespaces via pidfds
Hi. On Tue, May 05, 2020 at 04:04:31PM +0200, Christian Brauner wrote: > -SYSCALL_DEFINE2(setns, int, fd, int, nstype) > +SYSCALL_DEFINE2(setns, int, fd, int, flags) > [...] > - file = proc_ns_fget(fd); > - if (IS_ERR(file)) > - return PTR_ERR(file); > + int err = 0; > > - err = -EINVAL; > - ns = get_proc_ns(file_inode(file)); > - if (nstype && (ns->ops->type != nstype)) > + file = fget(fd); > + if (!file) > + return -EBADF; > + > + if (proc_ns_file(file)) { > + ns = get_proc_ns(file_inode(file)); > + if (flags && (ns->ops->type != flags)) > + err = -EINVAL; > + flags = ns->ops->type; > + } else if (pidfd_pid(file)) { > + err = check_setns_flags(flags); > + } else { > + err = -EBADF; > + } > + if (err) > goto out; > > - err = prepare_nsset(ns->ops->type, ); > + err = prepare_nsset(flags, ); > if (err) > goto out; This modification changed the returned error when a valid file descriptor is passed but it doesn't represent a namespace (nor pidfd). The error is now EBADF although originally and per man page it was/should be EINVAL. A change like below would restore it, however, I see it may be less consistent with other pidfd calls(?), then I'd suggest updating the manpage to capture this. --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -531,7 +531,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags) } else if (!IS_ERR(pidfd_pid(file))) { err = check_setns_flags(flags); } else { - err = -EBADF; + err = -EINVAL; } if (err) goto out; I noticed this breaks systemd self tests [1]. Regards, Michal [1] https://github.com/systemd/systemd/blob/a1ba8c5b71164665ccb53c9cec384e5eef7d3689/src/test/test-seccomp.c#L246 signature.asc Description: Digital signature
Re: [PATCH 3/6] mm, memcg: Prevent memory.low load/store tearing
Hello. I see suspicious asymmetry, in the current mainline: > WRITE_ONCE(memcg->memory.emin, effective_protection(usage, parent_usage, > READ_ONCE(memcg->memory.min), > READ_ONCE(parent->memory.emin), > atomic_long_read(>memory.children_min_usage))); > > WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage, > memcg->memory.low, READ_ONCE(parent->memory.elow), > atomic_long_read(>memory.children_low_usage))); On Thu, Mar 12, 2020 at 05:33:01PM +, Chris Down wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index aca2964ea494..c85a304fa4a1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6262,7 +6262,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct > mem_cgroup *root, > return MEMCG_PROT_NONE; > > emin = memcg->memory.min; > - elow = memcg->memory.low; > + elow = READ_ONCE(memcg->memory.low); > > parent = parent_mem_cgroup(memcg); > /* No parent means a non-hierarchical mode on v1 memcg */ > @@ -6291,7 +6291,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct > mem_cgroup *root, > if (elow && parent_elow) { > unsigned long low_usage, siblings_low_usage; > > - low_usage = min(usage, memcg->memory.low); > + low_usage = min(usage, READ_ONCE(memcg->memory.low)); > siblings_low_usage = atomic_long_read( > >memory.children_low_usage); Is it possible that these hunks were lost during rebase/merge? IMHO it should apply as: -- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6428,7 +6428,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, atomic_long_read(>memory.children_min_usage))); WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage, - memcg->memory.low, READ_ONCE(parent->memory.elow), + READ_ONCE(memcg->memory.low), + READ_ONCE(parent->memory.elow), atomic_long_read(>memory.children_low_usage))); out: Michal signature.asc Description: Digital signature
Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups
On Fri, Oct 04, 2019 at 03:11:04PM -0700, Roman Gushchin wrote: > An inode which is getting dirty for the first time is associated > with the wb structure (look at __inode_attach_wb()). It can later > be switched to another wb under some conditions (e.g. some other > cgroup is writing a lot of data to the same inode), but generally > stays associated up to the end of life of the inode structure. What about dissociating the wb structure from the charged cgroup after the particular writeback finished? (I understand from your description that wb structure outlives the dirtier and is kept due to other inode (read) users, not sure if that's correct assumption.) Michal signature.asc Description: Digital signature
[PATCH 3/5] selftests: cgroup: Simplify task self migration
Simplify task migration by being oblivious about its PID during migration. This allows to easily migrate individual threads as well. This change brings no functional change and prepares grounds for thread granularity migrating tests. Signed-off-by: Michal Koutný --- tools/testing/selftests/cgroup/cgroup_util.c | 16 +++- tools/testing/selftests/cgroup/cgroup_util.h | 4 +++- tools/testing/selftests/cgroup/test_freezer.c | 2 +- 3 files changed, 15 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index bdb69599c4bd..f6573eac1365 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -282,10 +282,12 @@ int cg_enter(const char *cgroup, int pid) int cg_enter_current(const char *cgroup) { - char pidbuf[64]; + return cg_write(cgroup, "cgroup.procs", "0"); +} - snprintf(pidbuf, sizeof(pidbuf), "%d", getpid()); - return cg_write(cgroup, "cgroup.procs", pidbuf); +int cg_enter_current_thread(const char *cgroup) +{ + return cg_write(cgroup, "cgroup.threads", "0"); } int cg_run(const char *cgroup, @@ -410,11 +412,15 @@ int set_oom_adj_score(int pid, int score) return 0; } -char proc_read_text(int pid, const char *item, char *buf, size_t size) +ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t size) { char path[PATH_MAX]; - snprintf(path, sizeof(path), "/proc/%d/%s", pid, item); + if (!pid) + snprintf(path, sizeof(path), "/proc/%s/%s", +thread ? "thread-self" : "self", item); + else + snprintf(path, sizeof(path), "/proc/%d/%s", pid, item); return read_text(path, buf, size); } diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index c72f28046bfa..27ff21d82af1 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -1,4 +1,5 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#include #include #define PAGE_SIZE 4096 @@ -35,6 +36,7 @@ extern int cg_run(const char *cgroup, void *arg); extern int cg_enter(const char *cgroup, int pid); extern int cg_enter_current(const char *cgroup); +extern int cg_enter_current_thread(const char *cgroup); extern int cg_run_nowait(const char *cgroup, int (*fn)(const char *cgroup, void *arg), void *arg); @@ -45,4 +47,4 @@ extern int is_swap_enabled(void); extern int set_oom_adj_score(int pid, int score); extern int cg_wait_for_proc_count(const char *cgroup, int count); extern int cg_killall(const char *cgroup); -extern char proc_read_text(int pid, const char *item, char *buf, size_t size); +extern ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t size); diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c index 8219a30853d2..5896e35168cd 100644 --- a/tools/testing/selftests/cgroup/test_freezer.c +++ b/tools/testing/selftests/cgroup/test_freezer.c @@ -648,7 +648,7 @@ static int proc_check_stopped(int pid) char buf[PAGE_SIZE]; int len; - len = proc_read_text(pid, "stat", buf, sizeof(buf)); + len = proc_read_text(pid, 0, "stat", buf, sizeof(buf)); if (len == -1) { debug("Can't get %d stat\n", pid); return -1; -- 2.21.0
[PATCH 4/5] selftests: cgroup: Add task migration tests
Add two new tests that verify that thread and threadgroup migrations work as expected. Signed-off-by: Michal Koutný --- tools/testing/selftests/cgroup/Makefile | 2 +- tools/testing/selftests/cgroup/cgroup_util.c | 26 tools/testing/selftests/cgroup/cgroup_util.h | 2 + tools/testing/selftests/cgroup/test_core.c | 146 +++ 4 files changed, 175 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 8d369b6a2069..1c9179400be0 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS += -Wall +CFLAGS += -Wall -pthread all: diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index f6573eac1365..8f7131dcf1ff 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -158,6 +158,22 @@ long cg_read_key_long(const char *cgroup, const char *control, const char *key) return atol(ptr + strlen(key)); } +long cg_read_lc(const char *cgroup, const char *control) +{ + char buf[PAGE_SIZE]; + const char delim[] = "\n"; + char *line; + long cnt = 0; + + if (cg_read(cgroup, control, buf, sizeof(buf))) + return -1; + + for (line = strtok(buf, delim); line; line = strtok(NULL, delim)) + cnt++; + + return cnt; +} + int cg_write(const char *cgroup, const char *control, char *buf) { char path[PATH_MAX]; @@ -424,3 +440,13 @@ ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t return read_text(path, buf, size); } + +int proc_read_strstr(int pid, bool thread, const char *item, const char *needle) +{ + char buf[PAGE_SIZE]; + + if (proc_read_text(pid, thread, item, buf, sizeof(buf)) < 0) + return -1; + + return strstr(buf, needle) ? 0 : -1; +} diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index 27ff21d82af1..49c54fbdb229 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -30,6 +30,7 @@ extern int cg_read_strstr(const char *cgroup, const char *control, const char *needle); extern long cg_read_long(const char *cgroup, const char *control); long cg_read_key_long(const char *cgroup, const char *control, const char *key); +extern long cg_read_lc(const char *cgroup, const char *control); extern int cg_write(const char *cgroup, const char *control, char *buf); extern int cg_run(const char *cgroup, int (*fn)(const char *cgroup, void *arg), @@ -48,3 +49,4 @@ extern int set_oom_adj_score(int pid, int score); extern int cg_wait_for_proc_count(const char *cgroup, int count); extern int cg_killall(const char *cgroup); extern ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t size); +extern int proc_read_strstr(int pid, bool thread, const char *item, const char *needle); diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c index 79053a4f4783..c5ca669feb2b 100644 --- a/tools/testing/selftests/cgroup/test_core.c +++ b/tools/testing/selftests/cgroup/test_core.c @@ -5,6 +5,9 @@ #include #include #include +#include +#include +#include #include "../kselftest.h" #include "cgroup_util.h" @@ -354,6 +357,147 @@ static int test_cgcore_internal_process_constraint(const char *root) return ret; } +static void *dummy_thread_fn(void *arg) +{ + return (void *)(size_t)pause(); +} + +/* + * Test threadgroup migration. + * All threads of a process are migrated together. + */ +static int test_cgcore_proc_migration(const char *root) +{ + int ret = KSFT_FAIL; + int t, c_threads, n_threads = 13; + char *src = NULL, *dst = NULL; + pthread_t threads[n_threads]; + + src = cg_name(root, "cg_src"); + dst = cg_name(root, "cg_dst"); + if (!src || !dst) + goto cleanup; + + if (cg_create(src)) + goto cleanup; + if (cg_create(dst)) + goto cleanup; + + if (cg_enter_current(src)) + goto cleanup; + + for (c_threads = 0; c_threads < n_threads; ++c_threads) { + if (pthread_create([c_threads], NULL, dummy_thread_fn, NULL)) + goto cleanup; + } + + cg_enter_current(dst); + if (cg_read_lc(dst, "cgroup.threads") != n_threads + 1) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + for (t = 0; t < c_threads; ++t) { + pthread_cancel(threads[t]); + } + + for (t = 0; t < c_threads; ++t) { + pthread_join(threads[
[PATCH 5/5] selftests: cgroup: Run test_core under interfering stress
test_core tests various cgroup creation/removal and task migration paths. Run the tests repeatedly with interfering noise (for lockdep checks). Currently, forking noise and subsystem enabled/disabled switching are the implemented noises. Signed-off-by: Michal Koutný --- tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/test_stress.sh | 4 + tools/testing/selftests/cgroup/with_stress.sh | 101 ++ 3 files changed, 107 insertions(+) create mode 100755 tools/testing/selftests/cgroup/test_stress.sh create mode 100755 tools/testing/selftests/cgroup/with_stress.sh diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 1c9179400be0..66aafe1f5746 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -3,6 +3,8 @@ CFLAGS += -Wall -pthread all: +TEST_FILES := with_stress.sh +TEST_PROGS := test_stress.sh TEST_GEN_PROGS = test_memcontrol TEST_GEN_PROGS += test_core TEST_GEN_PROGS += test_freezer diff --git a/tools/testing/selftests/cgroup/test_stress.sh b/tools/testing/selftests/cgroup/test_stress.sh new file mode 100755 index ..15d9d5896394 --- /dev/null +++ b/tools/testing/selftests/cgroup/test_stress.sh @@ -0,0 +1,4 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +./with_stress.sh -s subsys -s fork ./test_core diff --git a/tools/testing/selftests/cgroup/with_stress.sh b/tools/testing/selftests/cgroup/with_stress.sh new file mode 100755 index ..e28c35008f5b --- /dev/null +++ b/tools/testing/selftests/cgroup/with_stress.sh @@ -0,0 +1,101 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +stress_fork() +{ + while true ; do + /usr/bin/true + sleep 0.01 + done +} + +stress_subsys() +{ + local verb=+ + while true ; do + echo $verb$subsys_ctrl >$sysfs/cgroup.subtree_control + [ $verb = "+" ] && verb=- || verb=+ + # incommensurable period with other stresses + sleep 0.011 + done +} + +init_and_check() +{ + sysfs=`mount -t cgroup2 | head -1 | awk '{ print $3 }'` + if [ ! -d "$sysfs" ]; then + echo "Skipping: cgroup2 is not mounted" >&2 + exit $ksft_skip + fi + + if ! echo +$subsys_ctrl >$sysfs/cgroup.subtree_control ; then + echo "Skipping: cannot enable $subsys_ctrl in $sysfs" >&2 + exit $ksft_skip + fi + + if ! echo -$subsys_ctrl >$sysfs/cgroup.subtree_control ; then + echo "Skipping: cannot disable $subsys_ctrl in $sysfs" >&2 + exit $ksft_skip + fi +} + +declare -a stresses +declare -a stress_pids +duration=5 +rc=0 +subsys_ctrl=cpuset +sysfs= + +while getopts c:d:hs: opt; do + case $opt in + c) + subsys_ctrl=$OPTARG + ;; + d) + duration=$OPTARG + ;; + h) + echo "Usage $0 [ -s stress ] ... [ -d duration ] [-c controller] cmd args .." + echo -e "\t default duration $duration seconds" + echo -e "\t default controller $subsys_ctrl" + exit + ;; + s) + func=stress_$OPTARG + if [ "x$(type -t $func)" != "xfunction" ] ; then + echo "Unknown stress $OPTARG" + exit 1 + fi + stresses+=($func) + ;; + esac +done +shift $((OPTIND - 1)) + +init_and_check + +for s in ${stresses[*]} ; do + $s & + stress_pids+=($!) +done + + +time=0 +start=$(date +%s) + +while [ $time -lt $duration ] ; do + $* + rc=$? + [ $rc -eq 0 ] || break + time=$(($(date +%s) - $start)) +done + +for pid in ${stress_pids[*]} ; do + kill -SIGTERM $pid + wait $pid +done + +exit $rc -- 2.21.0
[PATCH 1/5] cgroup: Update comments about task exit path
We no longer take cgroup_mutex in cgroup_exit and the exiting tasks are not moved to init_css_set, reflect that in several comments to prevent confusion. Signed-off-by: Michal Koutný --- kernel/cgroup/cgroup.c | 29 + 1 file changed, 9 insertions(+), 20 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 753afbca549f..1488bb732902 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -899,8 +899,7 @@ static void css_set_move_task(struct task_struct *task, /* * We are synchronized through cgroup_threadgroup_rwsem * against PF_EXITING setting such that we can't race -* against cgroup_exit() changing the css_set to -* init_css_set and dropping the old one. +* against cgroup_exit()/cgroup_free() dropping the css_set. */ WARN_ON_ONCE(task->flags & PF_EXITING); @@ -1430,9 +1429,8 @@ struct cgroup *task_cgroup_from_root(struct task_struct *task, struct cgroup_root *root) { /* -* No need to lock the task - since we hold cgroup_mutex the -* task can't change groups, so the only thing that can happen -* is that it exits and its css is set back to init_css_set. +* No need to lock the task - since we hold css_set_lock the +* task can't change groups. */ return cset_cgroup_from_root(task_css_set(task), root); } @@ -6020,7 +6018,7 @@ void cgroup_post_fork(struct task_struct *child) struct css_set *cset; spin_lock_irq(_set_lock); - cset = task_css_set(current); + cset = task_css_set(current); /* current is @child's parent */ if (list_empty(>cg_list)) { get_css_set(cset); cset->nr_tasks++; @@ -6063,20 +6061,8 @@ void cgroup_post_fork(struct task_struct *child) * cgroup_exit - detach cgroup from exiting task * @tsk: pointer to task_struct of exiting process * - * Description: Detach cgroup from @tsk and release it. - * - * Note that cgroups marked notify_on_release force every task in - * them to take the global cgroup_mutex mutex when exiting. - * This could impact scaling on very large systems. Be reluctant to - * use notify_on_release cgroups where very high task exit scaling - * is required on large systems. + * Description: Detach cgroup from @tsk. * - * We set the exiting tasks cgroup to the root cgroup (top_cgroup). We - * call cgroup_exit() while the task is still competent to handle - * notify_on_release(), then leave the task attached to the root cgroup in - * each hierarchy for the remainder of its exit. No need to bother with - * init_css_set refcnting. init_css_set never goes away and we can't race - * with migration path - PF_EXITING is visible to migration path. */ void cgroup_exit(struct task_struct *tsk) { @@ -6086,7 +6072,8 @@ void cgroup_exit(struct task_struct *tsk) /* * Unlink from @tsk from its css_set. As migration path can't race -* with us, we can check css_set and cg_list without synchronization. +* with us (thanks to cgroup_threadgroup_rwsem), we can check css_set +* and cg_list without synchronization. */ cset = task_css_set(tsk); @@ -6102,6 +6089,8 @@ void cgroup_exit(struct task_struct *tsk) spin_unlock_irq(_set_lock); } else { + /* Take reference to avoid freeing init_css_set in cgroup_free, +* see cgroup_fork(). */ get_css_set(cset); } -- 2.21.0
[PATCH 0/5] Optimize single thread migration
Hello. The important part is the patch 02 where the reasoning is. The rest is mostly auxiliar and split out into separate commits for better readability. The patches are based on v5.3. Michal Michal Koutný (5): cgroup: Update comments about task exit path cgroup: Optimize single thread migration selftests: cgroup: Simplify task self migration selftests: cgroup: Add task migration tests selftests: cgroup: Run test_core under interfering stress kernel/cgroup/cgroup-internal.h | 5 +- kernel/cgroup/cgroup-v1.c | 5 +- kernel/cgroup/cgroup.c| 68 tools/testing/selftests/cgroup/Makefile | 4 +- tools/testing/selftests/cgroup/cgroup_util.c | 42 - tools/testing/selftests/cgroup/cgroup_util.h | 6 +- tools/testing/selftests/cgroup/test_core.c| 146 ++ tools/testing/selftests/cgroup/test_freezer.c | 2 +- tools/testing/selftests/cgroup/test_stress.sh | 4 + tools/testing/selftests/cgroup/with_stress.sh | 101 10 files changed, 342 insertions(+), 41 deletions(-) create mode 100755 tools/testing/selftests/cgroup/test_stress.sh create mode 100755 tools/testing/selftests/cgroup/with_stress.sh -- 2.21.0
[PATCH 2/5] cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and they report performance drop after d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"). The effect is pronounced on machines with more CPUs. The migration is affected by forking noise happening in the background, after the mentioned commit a migrating thread must wait for all (forking) processes on the system, not only of its threadgroup. There are several places that need to synchronize with migration: a) do_exit, b) de_thread, c) copy_process, d) cgroup_update_dfl_csses, e) parallel migration (cgroup_{proc,thread}s_write). In the case of self-migrating thread, we relax the synchronization on cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are excluded with cgroup_mutex, c) does not matter in case of single thread migration and the executing thread cannot exec(2) or exit(2) while it is writing into cgroup.threads. In case of do_exit because of signal delivery, we either exit before the migration or finish the migration (of not yet PF_EXITING thread) and die afterwards. This patch handles only the case of self-migration by writing "0" into cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem with numeric PIDs. This change improves migration dependent workload performance similar to per-signal_struct state. Signed-off-by: Michal Koutný --- kernel/cgroup/cgroup-internal.h | 5 +++-- kernel/cgroup/cgroup-v1.c | 5 +++-- kernel/cgroup/cgroup.c | 39 + 3 files changed, 36 insertions(+), 13 deletions(-) diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 809e34a3c017..90d1710fef6c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -231,9 +231,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup) +struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, +bool *locked) __acquires(_threadgroup_rwsem); -void cgroup_procs_write_finish(struct task_struct *task) +void cgroup_procs_write_finish(struct task_struct *task, bool locked) __releases(_threadgroup_rwsem); void cgroup_lock_and_drain_offline(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 88006be40ea3..dd0b263430d5 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -514,12 +514,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, struct task_struct *task; const struct cred *cred, *tcred; ssize_t ret; + bool locked; cgrp = cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; - task = cgroup_procs_write_start(buf, threadgroup); + task = cgroup_procs_write_start(buf, threadgroup, ); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -541,7 +542,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, ret = cgroup_attach_task(cgrp, task, threadgroup); out_finish: - cgroup_procs_write_finish(task); + cgroup_procs_write_finish(task, locked); out_unlock: cgroup_kn_unlock(of->kn); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1488bb732902..3f806dbe279b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2822,7 +2822,8 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, return ret; } -struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup) +struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, +bool *locked) __acquires(_threadgroup_rwsem) { struct task_struct *tsk; @@ -2831,7 +2832,21 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup) if (kstrtoint(strstrip(buf), 0, ) || pid < 0) return ERR_PTR(-EINVAL); - percpu_down_write(_threadgroup_rwsem); + /* +* If we migrate a single thread, we don't care about threadgroup +* stability. If the thread is `current`, it won't exit(2) under our +* hands or change PID through exec(2). We exclude +* cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write +* callers by cgroup_mutex. +* Therefore, we can skip the global lock. +*/ + lockdep_assert_held(_mutex); + if (pid || threadgroup) { + percpu_down_write(_threadgroup_rwsem); + *locked = true; + } else { + *locked = false; +
Re: [PATCH 08/10] blkcg: implement blk-iocost
Hi (and apology for relatively late reply). On Tue, Sep 10, 2019 at 09:08:55AM -0700, Tejun Heo wrote: > I can implement the switching if so. I see the "conflict" is solved by the switching. > Initially, I put them under block device sysfs but it was too clumsy > with different config file formats and all. Do you have any more details on that? In the end, it all boils down to a daemon/setup utility writing into the control files and it can use whatever config files it decides, can't it? > I think it's better to have global controller configs at the root > cgroup. I agree with the "global controller" configs, however, does it also hold for "global controller per-device" configs? They seem closer to the device than the controller. Potentially, the parameters could be used by some other consumers in the future. (I'm not opposing the current form, I just want to explore the space before an API is fixed.) > Not at all. These are system-wide configs. cgroup namespaces > shouldn't have anything which aren't in non-root cgroups. Thanks, I understand the cgroup namespaces are not meant to be transparent to their members. Michal signature.asc Description: Digital signature
Re: [PATCH RFC 00/14] The new slab memory controller
On Wed, Oct 02, 2019 at 10:00:07PM +0900, Suleiman Souhlal wrote: > kmem.slabinfo has been absolutely invaluable for debugging, in my experience. > I am however not aware of any automation based on it. My experience is the same. However, the point is that this has been exposed since ages, so the safe assumption is that there may be users. > Maybe it might be worth adding it to cgroup v2 and have a CONFIG > option to enable it? I don't think v2 file is necessary given the cost of obtaining the information. But I concur the idea of making the per-object tracking switchable at boot time or at least CONFIGurable. Michal signature.asc Description: Digital signature
Re: [PATCH RFC 00/14] The new slab memory controller
On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin wrote: > Roman Gushchin (14): > [...] > mm: memcg/slab: use one set of kmem_caches for all memory cgroups From that commit's message: > 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are > no per-memcg kmem_caches anymore (empty output is printed) The empty file means no allocations took place in the particular cgroup. I find this quite a surprising change for consumers of these stats. I understand obtaining the same data efficiently from the proposed structures is difficult, however, such a change should be avoided. (In my understanding, obsoleted file ~ not available in v2, however, it should not disappear from v1.) Michal signature.asc Description: Digital signature
Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
Hi. On Thu, Sep 26, 2019 at 05:55:29PM -0700, Mina Almasry wrote: > My guess is that a new controller needs to support cgroups-v2, which > is fine. But can a new controller also support v1? Or is there a > requirement that new controllers support *only* v2? I need whatever > solution here to work on v1. Here is my view of important criteria: 1) stability, v1 APIs and semantics should not be changed, 2) futureproofness, v1 uses should be convertible to v2 uses, 3) maintainability, the less (similar) code the better. And here is my braindump of some approaches: A) new controller, v2 only - 1) ok - 2) may be ok - 3) separate v1 and v2 implementations - exclusion must be ensured on hybrid hierarchies B) new controller, version oblivious (see e.g. pid) - 1) sort of ok - 2) partially ok - 3) two v1 implementations - exclusion must be ensured even on pure v1 hierarchies C) extending the existing controller, w/out v2 counterpart - 1) ok with workarounds (new option switching behavior) - 2) not ok - 3) likely OK D) extending the existing controller, with v2 counterpart - 1) ok with workarounds (new option switching behavior, see cpuset) - 2) may be ok - 3) likely OK AFAIU, the current patchset is variation of C). Personally, I think something like D) could work, I'm not convinced about A) and B) based on the breakdown above. But it may induce some other ideas. My two cents, Michal signature.asc Description: Digital signature
Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller
On Mon, Sep 02, 2019 at 04:02:57PM -0700, Suren Baghdasaryan wrote: > > +static inline void cpu_uclamp_print(struct seq_file *sf, > > + enum uclamp_id clamp_id) > > [...] > > + rcu_read_lock(); > > + tg = css_tg(seq_css(sf)); > > + util_clamp = tg->uclamp_req[clamp_id].value; > > + rcu_read_unlock(); > > + > > + if (util_clamp == SCHED_CAPACITY_SCALE) { > > + seq_puts(sf, "max\n"); > > + return; > > + } > > + > > + percent = tg->uclamp_pct[clamp_id]; > > You are taking RCU lock when accessing tg->uclamp_req but not when > accessing tg->uclamp_pct. Good point. > Is that intentional? Can tg be destroyed under you? Actually, the rcu_read{,un}lock should be unnecessary in the context of the kernfs file op handler -- the tg/css won't go away as long as its kernfs file is being worked with. signature.asc Description: Digital signature
Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits
(+CC cgro...@vger.kernel.org) On Thu, Aug 08, 2019 at 12:40:02PM -0700, Mina Almasry wrote: > We have developers interested in using hugetlb_cgroups, and they have > expressed > dissatisfaction regarding this behavior. I assume you still want to enforce a limit on a particular group and the application must be able to handle resource scarcity (but better notified than SIGBUS). > Alternatives considered: > [...] (I did not try that but) have you considered: 3) MAP_POPULATE while you're making the reservation, 4) Using multple hugetlbfs mounts with respective limits. > Caveats: > 1. This support is implemented for cgroups-v1. I have not tried >hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet. >This is largely because we use cgroups-v1 for now. Adding something new into v1 without v2 counterpart, is making migration harder, that's one of the reasons why v1 API is rather frozen now. (I'm not sure whether current hugetlb controller fits into v2 at all though.) Michal signature.asc Description: Digital signature
Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller
On Thu, Aug 08, 2019 at 04:10:21PM +0100, Patrick Bellasi wrote: > Not sure to get what you mean here: I'm currently exposing uclamp to > both v1 and v2 hierarchies. cpu controller has different API for v1 and v2 hierarchies. My question reworded is -- are the new knobs exposed in the legacy API intentionally/for a reason? Michal signature.asc Description: Digital signature
Re: [PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps
On Thu, Aug 08, 2019 at 04:08:10PM +0100, Patrick Bellasi wrote: > Well, if I've got correctly your comment in the previous message, I > would say that at this stage we don't need RCU looks at all. Agreed. > Reason being that cpu_util_update_eff() gets called only from > cpu_uclamp_write() which is from an ongoing write operation on a cgroup > attribute and thus granted to be available. > > We will eventually need to move the RCU look only down the stack when > uclamp_update_active_tasks() gets called to update the RUNNABLE tasks on > a RQ... or perhaps we don't need them since we already get the > task_rq_lock() for each task we visit. Unless you remove css_for_each_descendant_pre() in cpu_util_update_eff(), the rcu_read_lock() cannot go below it. (You'd be RCU-accessing other csses that aren't pinned in the write.) Michal signature.asc Description: Digital signature
Re: [PATCH v13 0/6] Add utilization clamping support (CGroups API)
On Fri, Aug 02, 2019 at 10:08:47AM +0100, Patrick Bellasi wrote: > Patrick Bellasi (6): > sched/core: uclamp: Extend CPU's cgroup controller > sched/core: uclamp: Propagate parent clamps > sched/core: uclamp: Propagate system defaults to root group > sched/core: uclamp: Use TG's clamps to restrict TASK's clamps > sched/core: uclamp: Update CPU's refcount on TG's clamp changes > sched/core: uclamp: always use enum uclamp_id for clamp_id values Thank you Patrick for your patience. I used the time to revisit the series once again and I think the RCU locks can be streamlined a bit. If you find that correct, feel free to add my Reviewed-by to the updated series (for 1/6 and legacy, I'm just asking). Michal signature.asc Description: Digital signature
Re: [PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps
On Fri, Aug 02, 2019 at 10:08:49AM +0100, Patrick Bellasi wrote: > @@ -7095,6 +7149,7 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file > *of, char *buf, > if (req.ret) > return req.ret; > > + mutex_lock(_mutex); > rcu_read_lock(); > > tg = css_tg(of_css(of)); > @@ -7107,7 +7162,11 @@ static ssize_t cpu_uclamp_write(struct > kernfs_open_file *of, char *buf, >*/ > tg->uclamp_pct[clamp_id] = req.percent; > > + /* Update effective clamps to track the most restrictive value */ > + cpu_util_update_eff(of_css(of)); > + > rcu_read_unlock(); > + mutex_unlock(_mutex); Following my remarks to "[PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup", I wonder if the rcu_read_lock() couldn't be moved right before cpu_util_update_eff(). And by extension rcu_read_(un)lock could be hidden into cpu_util_update_eff() closer to its actual need. signature.asc Description: Digital signature
Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller
On Fri, Aug 02, 2019 at 10:08:48AM +0100, Patrick Bellasi wrote: > +static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off, > + enum uclamp_id clamp_id) > +{ > + struct uclamp_request req; > + struct task_group *tg; > + > + req = capacity_from_percent(buf); > + if (req.ret) > + return req.ret; > + > + rcu_read_lock(); This should be the uclamp_mutex. (The compound results of the series is correct as the lock is introduced in "sched/core: uclamp: Propagate parent clamps". This is just for the happiness of cherry-pickers/bisectors.) > +static inline void cpu_uclamp_print(struct seq_file *sf, > + enum uclamp_id clamp_id) > +{ > [...] > + rcu_read_lock(); > + tg = css_tg(seq_css(sf)); > + util_clamp = tg->uclamp_req[clamp_id].value; > + rcu_read_unlock(); Why is the rcu_read_lock() needed here? (I'm considering the comment in of_css() that should apply here (and it seems that similar uses in other seq_file handlers also skip this).) > @@ -7369,6 +7506,20 @@ static struct cftype cpu_legacy_files[] = { > [...] > + .name = "uclamp.min", > [...] > + .name = "uclamp.max", I don't see technical reasons why uclamp couldn't work on legacy hierarchy and Tejun acked the series, despite that I'll ask -- should the new attributes be exposed in v1 controller hierarchy (taking into account the legacy API is sort of frozen and potential maintenance needs spanning both hierarchies)?
Re: [PATCH v12 3/6] sched/core: uclamp: Propagate system defaults to root group
On Thu, Jul 18, 2019 at 07:17:45PM +0100, Patrick Bellasi wrote: > The clamp values are not tunable at the level of the root task group. > That's for two main reasons: > > - the root group represents "system resources" which are always >entirely available from the cgroup standpoint. > > - when tuning/restricting "system resources" makes sense, tuning must >be done using a system wide API which should also be available when >control groups are not. > > When a system wide restriction is available, cgroups should be aware of > its value in order to know exactly how much "system resources" are > available for the subgroups. IIUC, the global default would apply in uclamp_eff_get(), so this propagation isn't strictly necessary in order to apply to tasks (that's how it works under !CONFIG_UCLAMP_TASK_GROUP). The reason is that effective value (which isn't exposed currently) in a group takes into account this global restriction, right? > @@ -1043,12 +1063,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table > *table, int write, > [...] > + if (update_root_tg) > + uclamp_update_root_tg(); > + > /* >* Updating all the RUNNABLE task is expensive, keep it simple and do >* just a lazy update at each next enqueue time. Since uclamp_update_root_tg() traverses down to uclamp_update_active_tasks() is this comment half true now? signature.asc Description: Digital signature
Re: [PATCH v12 1/6] sched/core: uclamp: Extend CPU's cgroup controller
On Thu, Jul 18, 2019 at 07:17:43PM +0100, Patrick Bellasi wrote: > +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, > + loff_t off) > +{ > [...] > +static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, > + loff_t off) > +{ > [...] These two functions are almost identical yet not trivial. I think it wouldn be better to have the code at one place only and distinguish by the passed clamp_id. signature.asc Description: Digital signature
Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
On Tue, Jul 16, 2019 at 10:41:36AM +0800, 王贇 wrote: > Actually whatever the memory node sets or cpu allow sets is, it will > take effect on task's behavior regarding memory location and cpu > location, while the locality only care about the results rather than > the sets. My previous response missed much of the context, so it was a bit off. I see what you mean by the locality now. Alas, I can't assess whether it's the right thing to do regarding NUMA behavior that you try to optimize (i.e. you need an answer from someone more familiar with NUMA balancing). Michal
Re: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat
On Tue, Jul 16, 2019 at 11:40:35AM +0800, 王贇 wrote: > By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new > output line heading with 'exectime', like: > > exectime 311900 407166 What you present are times aggregated over CPUs in the NUMA nodes, this seems a bit lossy interface. Despite you the aggregated information is sufficient for your monitoring, I think it's worth providing the information with the original granularity. Note that cpuacct v1 controller used to report such percpu runtime stats. The v2 implementation would rather build upon the rstat API. Michal
Re: [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
On Tue, Jul 16, 2019 at 03:34:35PM +0100, Patrick Bellasi wrote: > Am I missing something? No, it's rather my misinterpretation of the syscall semantics. > Otherwise, I think the changelog sentence you quoted is just > misleading. It certainly mislead me to thinking about the sched_setattr calls as requests of utilization being in the given interval (substituting 0 or 1 when only one boundary is given, and further constrained by tg's interval). I see your point, those are actually two (mostly) independent controls. Makes sense now. Thanks, Michal
Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
On Tue, Jul 16, 2019 at 03:34:17PM +0100, Patrick Bellasi wrote: > > cpu_util_update_eff internally calls css_for_each_descendant_pre() so > > this should be protected with rcu_read_lock(). > > Right, good catch! Will add in v12. When I responded to your other patch, it occurred to me that since cpu_util_update_eff goes writing down to child csses, it should take also uclamp_mutex here to avoid race with direct cgroup attribute writers.