Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
Fenghua, On Tue, Apr 13 2021 at 23:40, Fenghua Yu wrote: > On Mon, Apr 12, 2021 at 09:15:08AM +0200, Thomas Gleixner wrote: >> Aside of that why are you trying to make this throttling in any way >> accurate? It does not matter at all, really. Limit reached, put it to >> sleep for some time and be done with it. No point in trying to be clever >> for no value. > > Is it OK to set bld_ratelimit between 1 and 1,000 bus locks/sec for > bld_ratelimit? > > Can I do the throttling like this? > >/* Enforce no more than bld_ratelimit bus locks/sec. */ >while (!__ratelimit(_bld_ratelimit)) >msleep(10); > > On one machine, if bld_ratelimit=1,000, that's about 5msec for a busy > bus lock loop, i.e. bus is locked for about 5msec and then the process > sleeps for 10msec and thus won't generate any bus lock. > "dd" command running on other cores doesn't have noticeable degradation > with bld_ratelimit=1,000. Something like this makes sense. Add some rationale for the upper limit you finally end up with so sysadmins can make informed decisions. Thanks, tglx
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
Hi, Thomas, On Mon, Apr 12, 2021 at 09:15:08AM +0200, Thomas Gleixner wrote: > On Sat, Apr 03 2021 at 01:04, Fenghua Yu wrote: > > On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote: > >> On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: > >> And even with throttling the injection rate further down to 25k per > >> second the impact on the workload is still significant in the 10% range. > > So I can change the ratelimit to system wide and call usleep_range() > > to sleep: > >while (!__ratelimit(_bld_ratelimit)) > >usleep_range(100 / bld_ratelimit, > > 100 / bld_ratelimit); > > > > The max bld_ratelimit is 1000,000/s because the max sleeping time is 1 > > usec. > > Maximum sleep time is 1usec? > > > The min bld_ratelimit is 1/s. > > Again. This does not make sense at all. 1Mio bus lock events per second > are way beyond the point where the machine does anything else than being > stuck in buslocks. > > Aside of that why are you trying to make this throttling in any way > accurate? It does not matter at all, really. Limit reached, put it to > sleep for some time and be done with it. No point in trying to be clever > for no value. Is it OK to set bld_ratelimit between 1 and 1,000 bus locks/sec for bld_ratelimit? Can I do the throttling like this? /* Enforce no more than bld_ratelimit bus locks/sec. */ while (!__ratelimit(_bld_ratelimit)) msleep(10); On one machine, if bld_ratelimit=1,000, that's about 5msec for a busy bus lock loop, i.e. bus is locked for about 5msec and then the process sleeps for 10msec and thus won't generate any bus lock. "dd" command running on other cores doesn't have noticeable degradation with bld_ratelimit=1,000. Thanks. -Fenghua
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
On Sat, Apr 03 2021 at 01:04, Fenghua Yu wrote: > On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote: >> On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: >> And even with throttling the injection rate further down to 25k per >> second the impact on the workload is still significant in the 10% range. > > Thank you for your insight! > > So I can change the ratelimit to system wide and call usleep_range() > to sleep: >while (!__ratelimit(_bld_ratelimit)) >usleep_range(100 / bld_ratelimit, > 100 / bld_ratelimit); > > The max bld_ratelimit is 1000,000/s because the max sleeping time is 1 > usec. Maximum sleep time is 1usec? > The min bld_ratelimit is 1/s. Again. This does not make sense at all. 1Mio bus lock events per second are way beyond the point where the machine does anything else than being stuck in buslocks. Aside of that why are you trying to make this throttling in any way accurate? It does not matter at all, really. Limit reached, put it to sleep for some time and be done with it. No point in trying to be clever for no value. Thanks, tglx
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
Hi, Thomas, On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote: > On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: > > On Fri, Mar 19, 2021 at 10:30:50PM +0100, Thomas Gleixner wrote: > >> > +if (sscanf(arg, "ratelimit:%d", ) == 1 && ratelimit > > >> > 0) { > >> > +bld_ratelimit = ratelimit; > >> > >> So any rate up to INTMAX/s is valid here, right? > > > > Yes. I don't see smaller limitation than INTMX/s. Is that right? > > That's a given, but what's the point of limits in that range? > > A buslock access locks up the system for X cycles. So the total amount > of allowable damage in cycles per second is: > >limit * stall_cycles_per_bus_lock > > ergo the time (in seconds) which the system is locked up is: > >limit * stall_cycles_per_bus_lock / cpufreq > > Which means for ~INTMAX/2 on a 2 GHz CPU: > >2 * 10^9 * $CYCLES / 2 * 10^9 = $CYCLES seconds > > Assumed the inflicted damage is only 1 cycle then #LOCK is pretty much > permanently on if there are enough threads. Sure #DB will slow them > down, but it still does not make any sense at all especially as the > damage is certainly greater than a single cycle. > > And because the changelogs and the docs are void of numbers I just got > real numbers myself. > > With a single thread doing a 'lock inc *mem' accross a cache line > boundary the workload which I measured with perf stat goes from: > > 5,940,985,091 instructions #0.88 insn per cycle > >2.780950806 seconds time elapsed >0.99848 seconds user >4.202137000 seconds sys > to > > 7,467,979,504 instructions #0.10 insn per cycle > >5.110795917 seconds time elapsed >7.123499000 seconds user > 37.266852000 seconds sys > > The buslock injection rate is ~250k per second. > > Even if I ratelimit the locked inc by a delay loop of ~5000 cycles > which is probably more than what the #DB will cost then this single task > still impacts the workload significantly: > > 6,496,994,537 instructions #0.39 insn per cycle > >3.043275473 seconds time elapsed >1.899852000 seconds user >8.957088000 seconds sys > > The buslock injection rate is down to ~150k per second in this case. > > And even with throttling the injection rate further down to 25k per > second the impact on the workload is still significant in the 10% range. Thank you for your insight! So I can change the ratelimit to system wide and call usleep_range() to sleep: while (!__ratelimit(_bld_ratelimit)) usleep_range(100 / bld_ratelimit, 100 / bld_ratelimit); The max bld_ratelimit is 1000,000/s because the max sleeping time is 1 usec. The min bld_ratelimit is 1/s. > > And of course the documentation of the ratelimit parameter explains all > of this in great detail so the administrator has a trivial job to tune > that, right? I will explain how to tune the parameter in buslock.rst doc. > > >> > +case sld_ratelimit: > >> > +/* Enforce no more than bld_ratelimit bus locks/sec. */ > >> > +while (!__ratelimit(_current_user()->bld_ratelimit)) > >> > +msleep(1000 / bld_ratelimit); > > For any ratelimit > 1000 this will loop up to 1000 times with > CONFIG_HZ=1000. > > Assume that the buslock producer has tons of threads which all end up > here pretty soon then you launch a mass wakeup in the worst case every > jiffy. Are you sure that the cure is better than the disease? if using usleep_range() to sleep, the threads will not sleep and wakeup, right? Maybe I can use msleep() for msec (bigger bld_ratelimit) and usleep_range() for usec (smaller bld_ratelimit)? Even if there is mass wakeup, throttling the threads can avoid the system wide performance degradation (e.g. 7x slower dd command in another user). Is that a good justification for throttling the threads? > > > If I split this whole patch set into two patch sets: > > 1. Three patches in the first patch set: the enumeration patch, the warn > >and fatal patch, and the documentation patch. > > 2. Two patches in the second patch set: the ratelimit patch and the > >documentation patch. > > > > Then I will send the two patch sets separately, you will accept them one > > by one. Is that OK? > > That's obviously the right thing to do because #1 should be ready and we > can sort out #2 seperately. See the conversation with Tony. Thank you for picking up the first patch set! -Fenghua
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
Hi, Thomas, On Sat, Mar 20, 2021 at 02:57:52PM +0100, Thomas Gleixner wrote: > On Sat, Mar 20 2021 at 02:01, Thomas Gleixner wrote: > > > On Fri, Mar 19 2021 at 21:50, Tony Luck wrote: > >>> What is the justifucation for making this rate limit per UID and not > >>> per task, per process or systemwide? > >> > >> The concern is that a malicious user is running a workload that loops > >> obtaining the buslock. This brings the whole system to its knees. > >> > >> Limiting per task doesn't help. The user can just fork(2) a whole bunch > >> of tasks for a distributed buslock attack.. > > > > Fair enough. > > > >> Systemwide might be an interesting alternative. Downside would be > >> accidental > >> rate limit of non-malicious tasks that happen to grab a bus lock > >> periodically > >> but in the same window with other buslocks from other users. > >> > >> Do you think that a risk worth taking to make the code simpler? > > > > I'd consider it low risk, but I just looked for the usage of the > > existing ratelimit in struct user and the related commit. Nw it's dawns > > on me where you are coming from. > > So after getting real numbers myself, I have more thoughts on > this. Setting a reasonable per user limit might be hard when you want to > protect e.g. against an orchestrated effort by several users > (containers...). If each of them stays under the limit which is easy > enough to figure out then you still end up with significant accumulated > damage. > > So systemwide might not be the worst option after all. Indeed. > > The question is how wide spread are bus locks in existing applications? > I haven't found any on a dozen machines with random variants of > workloads so far according to perf ... -e sq_misc.split_lock. We have been running various tests widely inside Intel (and also outside) after enabling split lock and captured a few split lock issues in firmware, kernel, drivers, and apps. As you know, we have submitted a few patches to fix the split lock issues in the kernel and drivers (e.g. split lock in bit ops) and fixed a few split lock issues in firmware. But so far I'm not aware of any split lock issues in user space yet. I guess compilers do good cache line alignment good job to avoid this issue. But inline asm in user apps can easily hit this issue (on purpose). > > What's the actual scenario in the real world where a buslock access > might be legitimate? I did a simple experiment: looping on a split locked instruction on one core in one user can slow down "dd" command running on another core in another user by 7 times. A malicious user can do similar things to slow down the whole system performance, right? > > And what's the advice, recommendation for a system administrator how to > analyze the situation and what kind of parameter to set? > > I tried to get answers from Documentation/x86/buslock.rst, but Can I change the sleep code in the handle_bus_lock() to the following? while (!__ratelimit(_bld_ratelimit)) usleep_range(100 / bld_ratelimit, 100 / bld_ratelimit); Maybe the system wide bus lock ratelimit can be set to default value 1000,000/s which is also the max ratelimit value. The max sleep in the kernel is 1 us which means max bld_ratelimit can be up to 1000,000. If the system administrator think bus locks are less tolerant and wants to throttle bus lock further, bld_ratelimit can be set as a smaller number. The smallest bld_ratelimit is 1. When I gradually decreases bld_ratelimit value, I can see less bus locks can be issued per second systemwide and "dd" command or other memory benchmarks are less impacted by the bus locks. If this works, I will have the buslock.rst doc to explain the situation and how to set the parameter. Thanks. -Fenghua
RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
On Sat, Mar 20 2021 at 02:01, Thomas Gleixner wrote: > On Fri, Mar 19 2021 at 21:50, Tony Luck wrote: >>> What is the justifucation for making this rate limit per UID and not >>> per task, per process or systemwide? >> >> The concern is that a malicious user is running a workload that loops >> obtaining the buslock. This brings the whole system to its knees. >> >> Limiting per task doesn't help. The user can just fork(2) a whole bunch >> of tasks for a distributed buslock attack.. > > Fair enough. > >> Systemwide might be an interesting alternative. Downside would be accidental >> rate limit of non-malicious tasks that happen to grab a bus lock periodically >> but in the same window with other buslocks from other users. >> >> Do you think that a risk worth taking to make the code simpler? > > I'd consider it low risk, but I just looked for the usage of the > existing ratelimit in struct user and the related commit. Nw it's dawns > on me where you are coming from. So after getting real numbers myself, I have more thoughts on this. Setting a reasonable per user limit might be hard when you want to protect e.g. against an orchestrated effort by several users (containers...). If each of them stays under the limit which is easy enough to figure out then you still end up with significant accumulated damage. So systemwide might not be the worst option after all. The question is how wide spread are bus locks in existing applications? I haven't found any on a dozen machines with random variants of workloads so far according to perf ... -e sq_misc.split_lock. What's the actual scenario in the real world where a buslock access might be legitimate? And what's the advice, recommendation for a system administrator how to analyze the situation and what kind of parameter to set? I tried to get answers from Documentation/x86/buslock.rst, but Thanks, tglx
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: > On Fri, Mar 19, 2021 at 10:30:50PM +0100, Thomas Gleixner wrote: >> > + if (sscanf(arg, "ratelimit:%d", ) == 1 && ratelimit > 0) { >> > + bld_ratelimit = ratelimit; >> >> So any rate up to INTMAX/s is valid here, right? > > Yes. I don't see smaller limitation than INTMX/s. Is that right? That's a given, but what's the point of limits in that range? A buslock access locks up the system for X cycles. So the total amount of allowable damage in cycles per second is: limit * stall_cycles_per_bus_lock ergo the time (in seconds) which the system is locked up is: limit * stall_cycles_per_bus_lock / cpufreq Which means for ~INTMAX/2 on a 2 GHz CPU: 2 * 10^9 * $CYCLES / 2 * 10^9 = $CYCLES seconds Assumed the inflicted damage is only 1 cycle then #LOCK is pretty much permanently on if there are enough threads. Sure #DB will slow them down, but it still does not make any sense at all especially as the damage is certainly greater than a single cycle. And because the changelogs and the docs are void of numbers I just got real numbers myself. With a single thread doing a 'lock inc *mem' accross a cache line boundary the workload which I measured with perf stat goes from: 5,940,985,091 instructions #0.88 insn per cycle 2.780950806 seconds time elapsed 0.99848 seconds user 4.202137000 seconds sys to 7,467,979,504 instructions #0.10 insn per cycle 5.110795917 seconds time elapsed 7.123499000 seconds user 37.266852000 seconds sys The buslock injection rate is ~250k per second. Even if I ratelimit the locked inc by a delay loop of ~5000 cycles which is probably more than what the #DB will cost then this single task still impacts the workload significantly: 6,496,994,537 instructions #0.39 insn per cycle 3.043275473 seconds time elapsed 1.899852000 seconds user 8.957088000 seconds sys The buslock injection rate is down to ~150k per second in this case. And even with throttling the injection rate further down to 25k per second the impact on the workload is still significant in the 10% range. And of course the documentation of the ratelimit parameter explains all of this in great detail so the administrator has a trivial job to tune that, right? >> > + case sld_ratelimit: >> > + /* Enforce no more than bld_ratelimit bus locks/sec. */ >> > + while (!__ratelimit(_current_user()->bld_ratelimit)) >> > + msleep(1000 / bld_ratelimit); For any ratelimit > 1000 this will loop up to 1000 times with CONFIG_HZ=1000. Assume that the buslock producer has tons of threads which all end up here pretty soon then you launch a mass wakeup in the worst case every jiffy. Are you sure that the cure is better than the disease? > If I split this whole patch set into two patch sets: > 1. Three patches in the first patch set: the enumeration patch, the warn >and fatal patch, and the documentation patch. > 2. Two patches in the second patch set: the ratelimit patch and the >documentation patch. > > Then I will send the two patch sets separately, you will accept them one > by one. Is that OK? That's obviously the right thing to do because #1 should be ready and we can sort out #2 seperately. See the conversation with Tony. Thanks, tglx
RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
On Fri, Mar 19 2021 at 21:50, Tony Luck wrote: >> What is the justifucation for making this rate limit per UID and not >> per task, per process or systemwide? > > The concern is that a malicious user is running a workload that loops > obtaining the buslock. This brings the whole system to its knees. > > Limiting per task doesn't help. The user can just fork(2) a whole bunch > of tasks for a distributed buslock attack.. Fair enough. > Systemwide might be an interesting alternative. Downside would be accidental > rate limit of non-malicious tasks that happen to grab a bus lock periodically > but in the same window with other buslocks from other users. > > Do you think that a risk worth taking to make the code simpler? I'd consider it low risk, but I just looked for the usage of the existing ratelimit in struct user and the related commit. Nw it's dawns on me where you are coming from. So that seems to become a pattern ... so the uncompiled thing below might solve that. Yes, it makes the efivars thingy slower, but do we care? We neither care about efivars performance nor about the buslock performance. But I pretty much care about no having to stare at code which gets the fundamental refcounting wrong. Thanks, tglx --- fs/efivarfs/Kconfig |1 + fs/efivarfs/file.c| 11 +-- include/linux/ratelimit.h |1 + include/linux/ratelimit_uid.h | 26 ++ include/linux/sched/user.h|4 ++-- kernel/user.c |7 --- lib/Kconfig |3 +++ lib/ratelimit.c | 41 + 8 files changed, 87 insertions(+), 7 deletions(-) --- a/fs/efivarfs/Kconfig +++ b/fs/efivarfs/Kconfig @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only config EFIVAR_FS tristate "EFI Variable filesystem" + select UID_RATELIMIT depends on EFI default m help --- a/fs/efivarfs/file.c +++ b/fs/efivarfs/file.c @@ -63,6 +63,14 @@ static ssize_t efivarfs_file_write(struc return bytes; } +static const struct uid_ratelimit_cfg efivars_rl = { + .which = UID_RATELIMIT_EFIVARS, + .interval = HZ, + .burst = 100, + .flags = RATELIMIT_MSG_ON_RELEASE, + .delay = 50, +}; + static ssize_t efivarfs_file_read(struct file *file, char __user *userbuf, size_t count, loff_t *ppos) { @@ -73,8 +81,7 @@ static ssize_t efivarfs_file_read(struct ssize_t size = 0; int err; - while (!__ratelimit(>f_cred->user->ratelimit)) - msleep(50); + uid_ratelimit(_rl); err = efivar_entry_size(var, ); --- a/include/linux/ratelimit.h +++ b/include/linux/ratelimit.h @@ -3,6 +3,7 @@ #define _LINUX_RATELIMIT_H #include +#include #include #include --- /dev/null +++ b/include/linux/ratelimit_uid.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RATELIMIT_UID_H +#define _LINUX_RATELIMIT_UID_H + +/* Per UID ratelimits */ +enum uid_ratelimits { +#ifdef CONFIG_EFIVAR_FS + UID_RATELIMIT_EFIVARS, +#endif + UID_RATELIMIT_MAX, +}; + +#define UID_RATELIMIT_NODELAY ULONG_MAX + +struct uid_ratelimit_cfg { + enum uid_ratelimits which; + int interval; + int burst; + unsigned long flags; + unsigned long delay; +}; + +extern int __uid_ratelimit(const struct uid_ratelimit_cfg *cfg, const void *func); +#define uid_ratelimit(cfg) __uid_ratelimit(cfg, __func__) + +#endif --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -40,8 +40,8 @@ struct user_struct { atomic_t nr_watches;/* The number of watches this user currently has */ #endif - /* Miscellaneous per-user rate limit */ - struct ratelimit_state ratelimit; + /* Miscellaneous per-user rate limits storage */ + struct ratelimit_state *ratelimits[UID_RATELIMIT_MAX]; }; extern int uids_sysfs_init(void); --- a/kernel/user.c +++ b/kernel/user.c @@ -102,7 +102,6 @@ struct user_struct root_user = { .sigpending = ATOMIC_INIT(0), .locked_shm = 0, .uid= GLOBAL_ROOT_UID, - .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), }; /* @@ -139,8 +138,12 @@ static struct user_struct *uid_hash_find static void free_user(struct user_struct *up, unsigned long flags) __releases(_lock) { + unsigned int i; + uid_hash_remove(up); spin_unlock_irqrestore(_lock, flags); + for (i = 0; i < UID_RATELIMIT_MAX; i++) + kfree(up->ratelimits[i]); kmem_cache_free(uid_cachep, up); } @@ -188,8 +191,6 @@ struct user_struct *alloc_uid(kuid_t uid new->uid = uid; refcount_set(>__count, 1); - ratelimit_state_init(>ratelimit, HZ, 100); -
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
Hi, Thomas, On Fri, Mar 19, 2021 at 10:30:50PM +0100, Thomas Gleixner wrote: > On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > > Change Log: > > v5: > > Address all comments from Thomas: > > - Merge patch 2 and patch 3 into one patch so all "split_lock_detect=" > > options are processed in one patch. > > What? I certainly did not request that. I said: > > "Why is this seperate and an all in one thing? patch 2/4 changes the > parameter first and 3/4 adds a new option" > > which means we want documentation for patch 2 and documentation for > patch 3? > > The ratelimit thing is clearly an extra functionality on top of that > buslock muck. > > Next time I write it out.. Sorry for misunderstanding your comments. I will split the document patch into two: one for patch 2 (warn and fatal) and one for patch 3 (ratelimit). > > > + if (sscanf(arg, "ratelimit:%d", ) == 1 && ratelimit > 0) { > > + bld_ratelimit = ratelimit; > > So any rate up to INTMAX/s is valid here, right? Yes. I don't see smaller limitation than INTMX/s. Is that right? > > > + case sld_ratelimit: > > + /* Enforce no more than bld_ratelimit bus locks/sec. */ > > + while (!__ratelimit(_current_user()->bld_ratelimit)) > > + msleep(1000 / bld_ratelimit); > > which is cute because msleep() will always sleep until the next jiffie > increment happens. > > What's not so cute here is the fact that get_current_user() takes a > reference on current's UID on every invocation, but nothing ever calls > free_uid(). I missed that last time over the way more obvious HZ division. I will call free_uid(). > > > +++ b/kernel/user.c > > @@ -103,6 +103,9 @@ struct user_struct root_user = { > > .locked_shm = 0, > > .uid= GLOBAL_ROOT_UID, > > .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), > > +#ifdef CONFIG_CPU_SUP_INTEL > > + .bld_ratelimit = RATELIMIT_STATE_INIT(root_user.bld_ratelimit, 0, 0), > > +#endif > > }; > > > > /* > > @@ -172,6 +175,11 @@ void free_uid(struct user_struct *up) > > free_user(up, flags); > > } > > > > +#ifdef CONFIG_CPU_SUP_INTEL > > +/* Some Intel CPUs may set this for rate-limited bus locks. */ > > +int bld_ratelimit; > > +#endif > > Of course this variable is still required to be in the core kernel code > because? > > While you decided to munge this all together, you obviously ignored the > following review comment: > > "It also lacks the information that the ratelimiting is per UID >and not per task and why this was chosen to be per UID..." > > There is still no reasoning neither in the changelog nor in the cover > letter nor in a reply to my review. > > So let me repeat my question and make it more explicit: > > What is the justifucation for making this rate limit per UID and not > per task, per process or systemwide? Tony jut now answered the justification. If that's OK, I will add the answer in the commit message. > > > struct user_struct *alloc_uid(kuid_t uid) > > { > > struct hlist_head *hashent = uidhashentry(uid); > > @@ -190,6 +198,11 @@ struct user_struct *alloc_uid(kuid_t uid) > > refcount_set(>__count, 1); > > ratelimit_state_init(>ratelimit, HZ, 100); > > ratelimit_set_flags(>ratelimit, RATELIMIT_MSG_ON_RELEASE); > > +#ifdef CONFIG_CPU_SUP_INTEL > > + ratelimit_state_init(>bld_ratelimit, HZ, bld_ratelimit); > > + ratelimit_set_flags(>bld_ratelimit, > > + RATELIMIT_MSG_ON_RELEASE); > > +#endif > > If this has a proper justification for being per user and having to add > 40 bytes per UID for something which is mostly unused then there are > definitely better ways to do that than slapping #ifdefs into > architecture agnostic core code. > > So if you instead of munging the code patches had split the > documentation, then I could apply the first 3 patches and we would only > have to sort out the ratelimiting muck. If I split this whole patch set into two patch sets: 1. Three patches in the first patch set: the enumeration patch, the warn and fatal patch, and the documentation patch. 2. Two patches in the second patch set: the ratelimit patch and the documentation patch. Then I will send the two patch sets separately, you will accept them one by one. Is that OK? Or should I still send the 5 patches in one patch set so you will pick up the first 3 patches and then the next 2 patches separately? Thanks. -Fenghua
RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
> What is the justifucation for making this rate limit per UID and not > per task, per process or systemwide? The concern is that a malicious user is running a workload that loops obtaining the buslock. This brings the whole system to its knees. Limiting per task doesn't help. The user can just fork(2) a whole bunch of tasks for a distributed buslock attack.. Systemwide might be an interesting alternative. Downside would be accidental rate limit of non-malicious tasks that happen to grab a bus lock periodically but in the same window with other buslocks from other users. Do you think that a risk worth taking to make the code simpler? -Tony
Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > Change Log: > v5: > Address all comments from Thomas: > - Merge patch 2 and patch 3 into one patch so all "split_lock_detect=" > options are processed in one patch. What? I certainly did not request that. I said: "Why is this seperate and an all in one thing? patch 2/4 changes the parameter first and 3/4 adds a new option" which means we want documentation for patch 2 and documentation for patch 3? The ratelimit thing is clearly an extra functionality on top of that buslock muck. Next time I write it out.. > + if (sscanf(arg, "ratelimit:%d", ) == 1 && ratelimit > 0) { > + bld_ratelimit = ratelimit; So any rate up to INTMAX/s is valid here, right? > + case sld_ratelimit: > + /* Enforce no more than bld_ratelimit bus locks/sec. */ > + while (!__ratelimit(_current_user()->bld_ratelimit)) > + msleep(1000 / bld_ratelimit); which is cute because msleep() will always sleep until the next jiffie increment happens. What's not so cute here is the fact that get_current_user() takes a reference on current's UID on every invocation, but nothing ever calls free_uid(). I missed that last time over the way more obvious HZ division. > +++ b/kernel/user.c > @@ -103,6 +103,9 @@ struct user_struct root_user = { > .locked_shm = 0, > .uid= GLOBAL_ROOT_UID, > .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), > +#ifdef CONFIG_CPU_SUP_INTEL > + .bld_ratelimit = RATELIMIT_STATE_INIT(root_user.bld_ratelimit, 0, 0), > +#endif > }; > > /* > @@ -172,6 +175,11 @@ void free_uid(struct user_struct *up) > free_user(up, flags); > } > > +#ifdef CONFIG_CPU_SUP_INTEL > +/* Some Intel CPUs may set this for rate-limited bus locks. */ > +int bld_ratelimit; > +#endif Of course this variable is still required to be in the core kernel code because? While you decided to munge this all together, you obviously ignored the following review comment: "It also lacks the information that the ratelimiting is per UID and not per task and why this was chosen to be per UID..." There is still no reasoning neither in the changelog nor in the cover letter nor in a reply to my review. So let me repeat my question and make it more explicit: What is the justifucation for making this rate limit per UID and not per task, per process or systemwide? > struct user_struct *alloc_uid(kuid_t uid) > { > struct hlist_head *hashent = uidhashentry(uid); > @@ -190,6 +198,11 @@ struct user_struct *alloc_uid(kuid_t uid) > refcount_set(>__count, 1); > ratelimit_state_init(>ratelimit, HZ, 100); > ratelimit_set_flags(>ratelimit, RATELIMIT_MSG_ON_RELEASE); > +#ifdef CONFIG_CPU_SUP_INTEL > + ratelimit_state_init(>bld_ratelimit, HZ, bld_ratelimit); > + ratelimit_set_flags(>bld_ratelimit, > + RATELIMIT_MSG_ON_RELEASE); > +#endif If this has a proper justification for being per user and having to add 40 bytes per UID for something which is mostly unused then there are definitely better ways to do that than slapping #ifdefs into architecture agnostic core code. So if you instead of munging the code patches had split the documentation, then I could apply the first 3 patches and we would only have to sort out the ratelimiting muck. Thanks, tglx
[PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
Bus locks degrade performance for the whole system, not just for the CPU that requested the bus lock. Two CPU features "#AC for split lock" and "#DB for bus lock" provide hooks so that the operating system may choose one of several mitigation strategies. #AC for split lock is already implemented. Add code to use the #DB for bus lock feature to cover additional situations with new options to mitigate. split_lock_detect= #AC for split lock #DB for bus lock off Do nothing Do nothing warnKernel OOPs Warn once per task and Warn once per task and and continues to run. disable future checking When both features are supported, warn in #AC fatal Kernel OOPs Send SIGBUS to user. Send SIGBUS to user When both features are supported, fatal in #AC ratelimit:N Do nothing Limit bus lock rate to N per second in the current non-root user. Default option is "warn". Hardware only generates #DB for bus lock detect when CPL>0 to avoid nested #DB from multiple bus locks while the first #DB is being handled. So no need to handle #DB for bus lock detected in the kernel. #DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR while #AC for split lock is enabled by split lock detection bit 29 in TEST_CTRL MSR. Both breakpoint and bus lock in the same instruction can trigger one #DB. The bus lock is handled before the breakpoint in the #DB handler. Delivery of #DB for bus lock in userspace clears DR6[11]. To avoid confusion in identifying #DB, #DB handler sets the bit to 1 before returning to the interrupted task. Signed-off-by: Fenghua Yu Reviewed-by: Tony Luck --- Change Log: v5: Address all comments from Thomas: - Merge patch 2 and patch 3 into one patch so all "split_lock_detect=" options are processed in one patch. - Change warn to #AC if both #AC and #DB are supported. - Remove sld and bld variables and use boot_cpu_has() to check bus lock split lock support. - Remove bus lock checking in handle_bus_lock(). - Remove bld_ratelimit < HZ/2 check. - Add rate limit handling comment in bus lock #DB. - Define bld_ratelimit only for Intel CPUs. v3: - Enable Bus Lock Detection when fatal to handle bus lock from non-WB (PeterZ). v2: - Send SIGBUS in fatal case for bus lock #DB (PeterZ). v1:: - Check bus lock bit by its positive polarity (Xiaoyao). RFC v3: - Remove DR6_RESERVED change (PeterZ). arch/x86/include/asm/cpu.h | 10 +- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/uapi/asm/debugreg.h | 1 + arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/cpu/intel.c | 148 +++ arch/x86/kernel/traps.c | 7 ++ include/linux/sched/user.h | 5 +- kernel/user.c| 13 +++ 8 files changed, 162 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index da78ccbd493b..991de5f2a09c 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -41,12 +41,14 @@ unsigned int x86_family(unsigned int sig); unsigned int x86_model(unsigned int sig); unsigned int x86_stepping(unsigned int sig); #ifdef CONFIG_CPU_SUP_INTEL -extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); +extern void __init sld_setup(struct cpuinfo_x86 *c); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); extern bool handle_guest_split_lock(unsigned long ip); +extern void handle_bus_lock(struct pt_regs *regs); +extern int bld_ratelimit; #else -static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} +static inline void __init sld_setup(struct cpuinfo_x86 *c) {} static inline void switch_to_sld(unsigned long tifn) {} static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code) { @@ -57,6 +59,10 @@ static inline bool handle_guest_split_lock(unsigned long ip) { return false; } + +static inline void handle_bus_lock(struct pt_regs *regs) +{ +} #endif #ifdef CONFIG_IA32_FEAT_CTL void init_ia32_feat_ctl(struct cpuinfo_x86 *c); diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..558485965f21 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -265,6 +265,7 @@ #define DEBUGCTLMSR_LBR(1UL << 0) /* last branch recording */ #define DEBUGCTLMSR_BTF_SHIFT 1 #define DEBUGCTLMSR_BTF(1UL << 1) /* single-step on branches */ +#define DEBUGCTLMSR_BUS_LOCK_DETECT(1UL << 2) /* Bus lock detect */ #define DEBUGCTLMSR_TR