Re: [PATCH] page_alloc: Fix freeing non-compound pages
On Tue, Sep 22, 2020 at 03:00:17PM +0100, Matthew Wilcox (Oracle) wrote: > Here is a very rare race which leaks memory: > > Page P0 is allocated to the page cache. > Page P1 is free. > > Thread A Thread BThread C > find_get_entry(): > xas_load() returns P0 > Removes P0 from page cache > Frees P0 > P0 merged with its buddy P1 > alloc_pages(GFP_KERNEL, 1) returns P0 > P0 has refcount 1 > page_cache_get_speculative(P0) > P0 has refcount 2 > __free_pages(P0) > P0 has refcount 1 > put_page(P0) > P1 is not freed > > Fix this by freeing all the pages in __free_pages() that won't be freed > by the call to put_page(). It's usually not a good idea to split a page, > but this is a very unlikely scenario. > > Fixes: e286781d5f2e ("mm: speculative page references") > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/page_alloc.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index fab5e97dc9ca..5db74797db39 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4943,10 +4943,19 @@ static inline void free_the_page(struct page *page, > unsigned int order) > __free_pages_ok(page, order); > } > > +/* > + * If we free a non-compound allocation, another thread may have a > + * speculative reference to the first page. It has no way of knowing > + * about the rest of the allocation, so we have to free all but the > + * first page here. > + */ > void __free_pages(struct page *page, unsigned int order) > { > if (put_page_testzero(page)) > free_the_page(page, order); > + else if (!PageHead(page)) > + while (order-- > 0) > + free_the_page(page + (1 << order), order); > } > EXPORT_SYMBOL(__free_pages); So the obvious question I have here is why not teach put_page() to free the whole thing?
Re: [PATCH] latency improvement in __smp_call_single_queue
On Wed, Sep 23, 2020 at 10:00:41AM -0500, George Prekas wrote: > If an interrupt arrives between llist_add and > send_call_function_single_ipi in the following code snippet, then the > remote CPU will not receive the IPI in a timely manner and subsequent > SMP calls even from other CPUs for other functions will be delayed: > > if (llist_add(node, &per_cpu(call_single_queue, cpu))) > send_call_function_single_ipi(cpu); > > Note: llist_add returns 1 if it was empty before the operation. > > CPU 0 | CPU 1 | CPU 2 > __smp_call_single_q(2,f1) | __smp_call_single_q(2,f2) | > llist_add returns 1 | | > interrupted | llist_add returns 0 | > ... | branch not taken | > ... | | > resumed | | > send_call_function_single_ipi | | > | | f1 > | | f2 > > The call from CPU 1 for function f2 will be delayed because CPU 0 was > interrupted. Do you happen to have any actual numbers and a use-case where this was relevant?
Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs
On Wed, Sep 23, 2020 at 02:11:23PM -0400, Nitesh Narayan Lal wrote: > Introduce a new API hk_num_online_cpus(), that can be used to > retrieve the number of online housekeeping CPUs that are meant to handle > managed IRQ jobs. > > This API is introduced for the drivers that were previously relying only > on num_online_cpus() to determine the number of MSIX vectors to create. > In an RT environment with large isolated but fewer housekeeping CPUs this > was leading to a situation where an attempt to move all of the vectors > corresponding to isolated CPUs to housekeeping CPUs were failing due to > per CPU vector limit. > > Signed-off-by: Nitesh Narayan Lal > --- > include/linux/sched/isolation.h | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h > index cc9f393e2a70..2e96b626e02e 100644 > --- a/include/linux/sched/isolation.h > +++ b/include/linux/sched/isolation.h > @@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags > flags) > return true; > } > > +static inline unsigned int hk_num_online_cpus(void) This breaks with the established naming of that header.
Re: [PATCH 0/2] measure latency of cpu hotplug path
On Wed, Sep 23, 2020 at 04:37:44PM -0700, Prasad Sodagudi wrote: > There are all changes related to cpu hotplug path and would like to seek > upstream review. These are all patches in Qualcomm downstream kernel > for a quite long time. First patch sets the rt prioity to hotplug > task and second patch adds cpuhp trace events. > > 1) cpu-hotplug: Always use real time scheduling when hotplugging a CPU > 2) cpu/hotplug: Add cpuhp_latency trace event Why? Hotplug is a known super slow path. If you care about hotplug latency you're doing it wrong.
Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
On Wed, Sep 23, 2020 at 11:52:51AM -0400, Steven Rostedt wrote: > On Wed, 23 Sep 2020 10:40:32 +0200 > pet...@infradead.org wrote: > > > However, with migrate_disable() we can have each task preempted in a > > migrate_disable() region, worse we can stack them all on the _same_ CPU > > (super ridiculous odds, sure). And then we end up only able to run one > > task, with the rest of the CPUs picking their nose. > > What if we just made migrate_disable() a local_lock() available for !RT? Can't, neiter migrate_disable() nor migrate_enable() are allowed to block -- which is what makes their implementation so 'interesting'. > This should lower the SHC in theory, if you can't have stacked migrate > disables on the same CPU. See this email in that other thread (you're on Cc too IIRC): https://lkml.kernel.org/r/20200923170809.gy1362...@hirez.programming.kicks-ass.net I think that is we 'frob' the balance PULL, we'll end up with something similar. Whichever way around we turn this thing, the migrate_disable() runtime (we'll have to add a tracer for that), will be an interference term on the lower priority task, exactly like preempt_disable() would be. We'll just not exclude a higher priority task from running. AFAICT; the best case is a single migrate_disable() nesting, where a higher priority task preempts in a migrate_disable() section -- this is per design. When this preempted task becomes elegible to run under the ideal model (IOW it becomes one of the M highest priority tasks), it might still have to wait for the preemptee's migrate_disable() section to complete. Thereby suffering interference in the exact duration of migrate_disable() section. Per this argument, the change from preempt_disable() to migrate_disable() gets us: - higher priority tasks gain reduced wake-up latency - lower priority tasks are unchanged and are subject to the exact same interference term as if the higher priority task were using preempt_disable(). Since we've already established this term is unbounded, any task but the highest priority task is basically buggered. TL;DR, if we get balancing fixed and achieve (near) the optimal case above, migrate_disable() is an over-all win. But it's provably non-deterministic as long as the migrate_disable() sections are non-deterministic. The reason this all mostly works in practise is (I think) because: - People care most about the higher prio RT tasks and craft them to mostly avoid the migrate_disable() infected code. - The preemption scenario is most pronounced at higher utilization scenarios, and I suspect this is fairly rare to begin with. - And while many of these migrate_disable() regions are unbound in theory, in practise they're often fairly reasonable. So my current todo list is: - Change RT PULL - Change DL PULL - Add migrate_disable() tracer; exactly like preempt/irqoff, except measuring task-runtime instead of cpu-time. - Add a mode that measures actual interference. - Add a traceevent to detect preemption in migrate_disable(). And then I suppose I should twist Daniel's arm to update his model to include these scenarios and numbers.
Re: possible deadlock in xfrm_policy_delete
On Thu, Sep 24, 2020 at 06:44:12AM +0200, Dmitry Vyukov wrote: > On Thu, Sep 24, 2020 at 6:36 AM Herbert Xu > wrote: > > > (k-slock-AF_INET6){+.-.}-{2:2} That's a seqlock. > > What's going on with all these bogus lockdep reports? > > > > These are two completely different locks, one is for TCP and the > > other is for SCTP. Why is lockdep suddenly beoming confused about > > this? > > > > FWIW this flood of bogus reports started on 16/Sep. > > > FWIW one of the dups of this issue was bisected to: > > commit 1909760f5fc3f123e47b4e24e0ccdc0fc8f3f106 > Author: Ahmed S. Darwish > Date: Fri Sep 4 15:32:31 2020 + > > seqlock: PREEMPT_RT: Do not starve seqlock_t writers > > Can it be related? Did that tree you're testing include 267580db047e ("seqlock: Unbreak lockdep") ?
Re: [PATCH 7/9] sched: Add migrate_disable()
On Wed, Sep 23, 2020 at 10:31:10AM +0200, Thomas Gleixner wrote: >In practice migrate disable could be taken into account on placement >decisions, but yes we don't have anything like that at the moment. I think at the very least we should do some of that. The premise is wanting to run the M highest priority tasks, when a CPU drops priority, it tries to PULL a higher priority task towards itself. If this PULL selects a migrate_disable() tasks, it means the task is in the M highest prio tasks. Since obviously that task cannot get pulled, we should then pull the current running task of that CPU, this would then allow the migrate_disable() task to resume execution. I'll go try and cook up something along those lines...
Re: [PATCH] sched/fair: Use bool parameter for update_tg_load_avg()
On Wed, Sep 23, 2020 at 09:29:35PM +0800, Xianting Tian wrote: > In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used, > sometimes update_tg_load_avg(cfs_rq, false) is used. So change it > to use bool parameter. afaict it's never true (or my git-grep failed), so why not remove the argument entirely?
Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote: > Alternatively this could of course be solved with per CPU page tables > which will come around some day anyway I fear. Previously (with PTI) we looked at making the entire kernel map per-CPU, and that takes a 2K copy on switch_mm() (or more general, the user part of whatever the top level directory is for architectures that have a shared kernel/user page-table setup in the first place). The idea was having a fixed per-cpu kernel page-table, share a bunch of (kernel) page-tables between all CPUs and then copy in the user part on switch. I've forgotten what the plan was for ASID/PCID in that scheme. For x86_64 we've been fearing the performance of that 2k copy, but I don't think we've ever actually bit the bullet and implemented it to see how bad it really is.
Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote: > On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote: > > On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner wrote: > >> > >> If a task is migrated to a different CPU then the mapping address will > >> change which will explode in colourful ways. > > > > Right you are. > > > > Maybe we really *could* call this new kmap functionality something > > like "kmap_percpu()" (or maybe "local" is good enough), and make it > > act like your RT code does for spinlocks - not disable preemption, but > > only disabling CPU migration. > > I"m all for it, but the scheduler people have opinions :) Right, so I'm concerned. migrate_disable() wrecks pretty much all Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is somewhat ironic. Yes, it allows breaking up non-preemptible regions of non-deterministic duration, and thereby both reduce and bound the scheduling latency, the cost for doing that is that the theory on CPU utilization/bandwidth go out the window. To easily see this consider an SMP system with a number of tasks equal to the number of CPUs. On a regular (preempt_disable) kernel we can always run each task, by virtue of always having an idle CPU to take the task. However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose. The end result is that, like with unbounded latency, we will still miss our deadline, simply because we got starved for CPU. Now, while we could (with a _lot_ of work) rework the kernel to not rely on the implicit per-cpu ness of things like spinlock_t, the moment we bring in basic primitives that rely on migrate_disable() we're stuck with it. The problem is; afaict there's been no research into this problem. There might be scheduling (read: balancing) schemes that can mitigate/solve this problem, or it might prove to be a 'hard' problem, I just don't know. But once we start down this road, it's going to be hell to get rid of it. That's why I've been arguing (for many years) to strictly limit this to PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build on.
Re: [PATCH 7/9] sched: Add migrate_disable()
On Mon, Sep 21, 2020 at 09:16:54PM +0200, Thomas Gleixner wrote: > On Mon, Sep 21 2020 at 18:36, Peter Zijlstra wrote: > > +/* > > + * Migrate-Disable and why it is (strongly) undesired. > > + * > > + * The premise of the Real-Time schedulers we have on Linux > > + * (SCHED_FIFO/SCHED_DEADLINE) is that M CPUs can/will run M tasks > > + * concurrently, provided there are sufficient runnable tasks, also known > > as > > + * work-conserving. For instance SCHED_DEADLINE tries to schedule the M > > + * earliest deadline threads, and SCHED_FIFO the M highest priority > > threads. > > + * > > + * The correctness of various scheduling models depends on this, but is it > > + * broken by migrate_disable() that doesn't imply preempt_disable(). Where > > + * preempt_disable() implies an immediate priority ceiling, preemptible > > + * migrate_disable() allows nesting. > > + * > > + * The worst case is that all tasks preempt one another in a > > migrate_disable() > > + * region and stack on a single CPU. This then reduces the available > > bandwidth > > + * to a single CPU. And since Real-Time schedulability theory considers the > > + * Worst-Case only, all Real-Time analysis shall revert to single-CPU > > + * (instantly solving the SMP analysis problem). > > I'm telling you for years that SMP is the source of all evils and > NR_CPUS=0 is the ultimate solution of all problems. Paul surely > disagrees as he thinks that NR_CPUS<0 is the right thing to do. Surely imaginary numbers are even better :-) > But seriously, I completely understand your concern vs. schedulability > theories, but those theories can neither deal well with preemption > disable simply because you can create other trainwrecks when enough low > priority tasks run long enough in preempt disabled regions in > parallel. Ah, no, those theories can deal with preemption disable perfectly fine. The result is an increase in latency. It so happens we don't like that, but that's our problem :-) > The scheduler simply does not know ahead how long these > sections will take and how many of them will run in parallel. Ah, but the thing is, preempt_disable() does not limit concurrency. Assuming idle CPUs, the waking task can always go elsewhere. The thing with migrate_disable() OTOH is that even though there are idle CPUs, we're actively prohibited from using them. > The theories make some assumptions about preempt disable and consider it > as temporary priority ceiling, but that's all assumptions as the bounds > of these operations simply unknown. Sure, that directly translates into unbounded (or rather of non-deterministic duration) latencies, which are bad for determinism. But the theory is fairly clear on this. > > + * The reason we have it anyway. > > + * > > + * PREEMPT_RT breaks a number of assumptions traditionally held. By > > forcing a > > + * number of primitives into becoming preemptible, they would also allow > > + * migration. This turns out to break a bunch of per-cpu usage. To this > > end, > > + * all these primitives employ migirate_disable() to restore this implicit > > + * assumption. > > + * > > + * This is a 'temporary' work-around at best. The correct solution is > > getting > > + * rid of the above assumptions and reworking the code to employ explicit > > + * per-cpu locking or short preempt-disable regions. > > What timeframe are you envisioning for 'temporary'? I assume something > which is closer to your retirement than to mine :) I figured we'd put a WARN on per-cpu usage with only migrate_disable(), under a Kconfig knob, much like how RCU-lockdep started, once all of PREEMPT_RT has landed. Gotta keep busy, right :-) > > + * The end goal must be to get rid of migrate_disable(), alternatively we > > need > > + * a schedulability theory that does not depend on abritrary migration. > > Finally something new the academics can twist their brain around :) I'm sure they've been waiting for more work ;-) > But as the kmap discussion has shown, the current situation of enforcing > preempt disable even on a !RT kernel is not pretty either. I looked at > quite some of the kmap_atomic() usage sites and the resulting > workarounds for non-preemptability are pretty horrible especially if > they do copy_from/to_user() or such in those regions. There is tons of > other code which really only requires migrate disable. Yes, I'm on that thread, I'll reply there as well, I really hate going down that path without having a decent understanding of the ramifications. The more we spread this muck around, the deeper the hole we dig for ourselves to climb out of. The thing is, afaik the only theory that 'works' with migrate_disable() is fully partitioned, but we break that by having cross CPU blocking chains.
Re: [RESEND PATCH V2 0/6] Support PCIe3 uncore PMU on Snow Ridge
On Mon, Sep 14, 2020 at 07:34:14AM -0700, kan.li...@linux.intel.com wrote: > From: Kan Liang > > Changes since V1: > - Drop the platform device solution > - A new uncore PCI sub driver solution is introduced which searches > the PCIe Root Port device via pci_get_device() and id table. > Register a PCI bus notifier for the remove notification. Once the > device is removed, the uncore driver can be notified to unregister > the corresponding PMU. > - Modify the parameters of uncore_pci_pmu_unregister() function. Bjorn, you hated on the last version of this thing, are you OK with this one?
Re: [PATCH 0/3] drm: commit_work scheduling
On Mon, Sep 21, 2020 at 11:21:54AM +0200, Daniel Vetter wrote: > So question to rt/worker folks: What's the best way to let userspace set > the scheduling mode and priorities of things the kernel does on its > behalf? Surely we're not the first ones where if userspace runs with some > rt priority it'll starve out the kernel workers that it needs. Hardcoding > something behind a subsystem ioctl (which just means every time userspace > changes what it does, we need a new such flag or mode) can't be the right > thing. > > Peter, Tejun? So regular workqueues do not support RT priorities, but you can set their nice value somewhere in /sys. The kthread_work stuff used in these patches result in a regular kthread and as such the user interface for changing its scheduling class or priority is that of any other 'random' task.
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Fri, Sep 18, 2020 at 12:48:24PM +0200, Oleg Nesterov wrote: > Of course, this assumes that atomic_t->counter underflows "correctly", just > like "unsigned int". We're documented that we do. Lots of code relies on that. See Documentation/atomic_t.txt TYPES > But again, do we really want this? I like the two counters better, avoids atomics entirely, some archs hare horridly expensive atomics (*cough* power *cough*). I just tried to be clever and use a single u64 load (where possible) instead of two 32bit loads and got the sum vs split order wrong.
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote: > + u64 sum = per_cpu_sum(*(u64 *)sem->read_count); Moo, that doesn't work, we have to do two separate sums. I shouldn't try to be clever on a Friday I suppose :-(
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Fri, Sep 18, 2020 at 12:04:32PM +0200, pet...@infradead.org wrote: > On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote: > > @@ -198,7 +198,9 @@ EXPORT_SYMBOL_GPL(__percpu_down_read); > > */ > > static bool readers_active_check(struct percpu_rw_semaphore *sem) > > { > > - if (per_cpu_sum(*sem->read_count) != 0) > > + u64 sum = per_cpu_sum(*(u64 *)sem->read_count); > > + > > + if (sum + (sum >> 32)) > > That obviously wants to be: > > if ((u32)(sum + (sum >> 32))) > > > return false; > > > > /* I suppose an alternative way of writing that would be something like: union { u64 sum; struct { u32 a, b; }; } var; var.sum = per_cpu_sum(*(u64 *)sem->read_count); if (var.a + var.b) return false; which is more verbose, but perhaps easier to read.
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote: > @@ -198,7 +198,9 @@ EXPORT_SYMBOL_GPL(__percpu_down_read); > */ > static bool readers_active_check(struct percpu_rw_semaphore *sem) > { > - if (per_cpu_sum(*sem->read_count) != 0) > + u64 sum = per_cpu_sum(*(u64 *)sem->read_count); > + > + if (sum + (sum >> 32)) That obviously wants to be: if ((u32)(sum + (sum >> 32))) > return false; > > /*
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Fri, Sep 18, 2020 at 11:07:02AM +0200, Jan Kara wrote: > If people really wanted to avoid irq-safe inc/dec for archs where it is > more expensive, one idea I had was that we could add 'read_count_in_irq' to > percpu_rw_semaphore. So callers in normal context would use read_count and > callers in irq context would use read_count_in_irq. And the writer side > would sum over both but we don't care about performance of that one much. That's not a bad idea... something like so I suppose. (completely untested) --- diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 5fda40f97fe9..9c847490a86a 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -11,7 +11,7 @@ struct percpu_rw_semaphore { struct rcu_sync rss; - unsigned int __percpu *read_count; + u32 __percpu*read_count; struct rcuwait writer; wait_queue_head_t waiters; atomic_tblock; @@ -60,7 +60,7 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) * anything we did within this RCU-sched read-size critical section. */ if (likely(rcu_sync_is_idle(&sem->rss))) - this_cpu_inc(*sem->read_count); + __this_cpu_inc(sem->read_count[0]); else __percpu_down_read(sem, false); /* Unconditional memory barrier */ /* @@ -74,12 +74,16 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { bool ret = true; +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + WARN_ON_ONCE(!in_task()); +#endif + preempt_disable(); /* * Same as in percpu_down_read(). */ if (likely(rcu_sync_is_idle(&sem->rss))) - this_cpu_inc(*sem->read_count); + __this_cpu_inc(sem->read_count[0]); else ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */ preempt_enable(); @@ -98,12 +102,16 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem) { rwsem_release(&sem->dep_map, _RET_IP_); +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + WARN_ON_ONCE(!in_task()); +#endif + preempt_disable(); /* * Same as in percpu_down_read(). */ if (likely(rcu_sync_is_idle(&sem->rss))) { - this_cpu_dec(*sem->read_count); + __this_cpu_dec(sem->read_count[0]); } else { /* * slowpath; reader will only ever wake a single blocked @@ -115,12 +123,39 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem) * aggregate zero, as that is the only time it matters) they * will also see our critical section. */ - this_cpu_dec(*sem->read_count); + __this_cpu_dec(sem->read_count[0]); rcuwait_wake_up(&sem->writer); } preempt_enable(); } +static inline void __percpu_up_read_irqsafe(struct percpu_rw_semaphore *sem) +{ + unsigned long flags; + + local_irq_save(flags); + /* +* Same as in percpu_down_read(). +*/ + if (likely(rcu_sync_is_idle(&sem->rss))) { + __this_cpu_dec(sem->read_count[1]); + } else { + /* +* slowpath; reader will only ever wake a single blocked +* writer. +*/ + smp_mb(); /* B matches C */ + /* +* In other words, if they see our decrement (presumably to +* aggregate zero, as that is the only time it matters) they +* will also see our critical section. +*/ + __this_cpu_dec(sem->read_count[1]); + rcuwait_wake_up(&sem->writer); + } + local_irq_restore(flags); +} + extern void percpu_down_write(struct percpu_rw_semaphore *); extern void percpu_up_write(struct percpu_rw_semaphore *); diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 70a32a576f3f..00741216a7f6 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -12,7 +12,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, const char *name, struct lock_class_key *key) { - sem->read_count = alloc_percpu(int); + sem->read_count = (u32 *)alloc_percpu(u64); if (unlikely(!sem->read_count)) return -ENOMEM; @@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem); static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { - this_cpu_inc(*sem->read_count); + __this_cpu_inc(sem->read_count[0]); /* * Due to having preemption disabled the decrement happens on @@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) if (likely(!atomic_read_acquire(&sem->block)))
Re: [patch 09/10] sched/core: Add migrate_disable/enable()
On Fri, Sep 18, 2020 at 09:00:03AM +0200, Thomas Gleixner wrote: > >> +void migrate_disable(void) > >> +{ > >> + unsigned long flags; > >> + > >> + if (!current->migration_ctrl.disable_cnt) { > >> + raw_spin_lock_irqsave(¤t->pi_lock, flags); > >> + current->migration_ctrl.disable_cnt++; > >> + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); > >> + } else { > >> + current->migration_ctrl.disable_cnt++; > >> + } > >> +} > > > > That pi_lock seems unfortunate, and it isn't obvious what the point of > > it is. > > Indeed. That obviously lacks a big fat comment. > > current->migration_ctrl.disable_cnt++ is obviously a RMW operation. So > you end up with the following: > > CPU0CPU1 > migrate_disable() >R = current->migration_ctrl.disable_cnt; > set_cpus_allowed_ptr() > task_rq_lock(); > if > > (!p->migration_ctrl.disable_cnt) { >current->migration_ctrl.disable_cnt = R + 1; > stop_one_cpu(); > ---> stopper_thread() > BUG_ON(task->migration_ctrl.disable_cnt); > > I tried to back out from that instead of BUG(), but that ended up being > even more of a hacky trainwreck than just biting the bullet and taking > pi_lock. You don't need the load-store for that I think, pure timing will do. Blergh, lemme prepare more wake-up juice and think about that.
Re: [patch 09/10] sched/core: Add migrate_disable/enable()
On Thu, Sep 17, 2020 at 06:30:01PM +0200, Sebastian Siewior wrote: > On 2020-09-17 17:54:10 [+0200], pet...@infradead.org wrote: > > I'm not sure what the problem with FPU was, I was throwing alternatives > > at tglx to see what would stick, in part to (re)discover the design > > constraints of this thing. > > was this recent or distant in the time line? The past few weeks :-) Thomas and me have been bickering about this stuff on IRC on and off. > > One reason for not allowing migrate_disable() to sleep was: FPU code. > > > > Could it be it does something like: > > > > preempt_disable(); > > spin_lock(); > > > > spin_unlock(); > > preempt_enable(); > > > > Where we'll never get preempted while migrate_disable()'d and thus never > > trigger any of the sleep paths? > > I try to get rid of something like that. This doesn't work either way > because the spin_lock() may block which it can't with disabled > preemption. Yeah, that obviously should have been migrate_disable/enable instead of spin_lock/unlock :/ > Ah. We used to have migrate_disable() in pagefault_disable(). The x86 > FPU code does > preempt_disable(); > … > pagefault_disable(); > > but that migrate_disable() was moved from pagefault_disable() to > kmap_atomic(). We shouldn't have > preempt_disable(); || local_irq_disable(); > kmap_atomic(); > > on RT. I've been running around removing those. See >a10dcebacdb0c ("fs/ntfs/aops.c: don't disable interrupts during > kmap_atomic()") >ce1e518190ea7 ("ide: don't disable interrupts during kmap_atomic()") >f3a1075e5fc34 ("block: don't disable interrupts during kmap_atomic()") Hmm, okay.
Re: [patch 09/10] sched/core: Add migrate_disable/enable()
On Thu, Sep 17, 2020 at 05:13:41PM +0200, Sebastian Siewior wrote: > On 2020-09-17 16:49:37 [+0200], pet...@infradead.org wrote: > > I'm aware of the duct-tape :-) But I was under the impression that we > > didn't want the duct-tape, and that there was lots of issues with the > > FPU code, or was that another issue? > > Of course it would be better not to need the duct tape. > Also symmetrical locking is what you want but clearly futex is one of > a kind. > > I'm currently not aware of any issues in the FPU code in regard to this. > A few weeks ago, I was looking for this kind of usage and only futex > popped up. I'm not sure what the problem with FPU was, I was throwing alternatives at tglx to see what would stick, in part to (re)discover the design constraints of this thing. One reason for not allowing migrate_disable() to sleep was: FPU code. Could it be it does something like: preempt_disable(); spin_lock(); spin_unlock(); preempt_enable(); Where we'll never get preempted while migrate_disable()'d and thus never trigger any of the sleep paths?
Re: [PATCH net-next 6/7] lockdep: provide dummy forward declaration of *_is_held() helpers
On Wed, Sep 16, 2020 at 11:45:27AM -0700, Jakub Kicinski wrote: > When CONFIG_LOCKDEP is not set, lock_is_held() and lockdep_is_held() > are not declared or defined. This forces all callers to use ifdefs > around these checks. > > Recent RCU changes added a lot of lockdep_is_held() calls inside > rcu_dereference_protected(). rcu_dereference_protected() hides > its argument on !LOCKDEP builds, but this may lead to unused variable > warnings. > > Provide forward declarations of lock_is_held() and lockdep_is_held() > but never define them. This way callers can keep them visible to > the compiler on !LOCKDEP builds and instead depend on dead code > elimination to remove the references before the linker barfs. > > We need lock_is_held() for RCU. > > Signed-off-by: Jakub Kicinski > -- > CC: pet...@infradead.org > CC: mi...@redhat.com > CC: w...@kernel.org > --- > include/linux/lockdep.h | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h > index 6a584b3e5c74..6b5bbc536bf6 100644 > --- a/include/linux/lockdep.h > +++ b/include/linux/lockdep.h > @@ -371,6 +371,12 @@ static inline void lockdep_unregister_key(struct > lock_class_key *key) > > #define lockdep_depth(tsk) (0) > > +/* > + * Dummy forward declarations, allow users to write less ifdef-y code > + * and depend on dead code elimination. > + */ > +int lock_is_held(const void *); extern int lock_is_held(const struct lockdep_map *); > +int lockdep_is_held(const void *); extern I suppose we can't pull the lockdep_is_held() definition out from under CONFIG_LOCKDEP because it does the ->dep_map dereference and many types will simply not have that member. > #define lockdep_is_held_type(l, r) (1) > > #define lockdep_assert_held(l) do { (void)(l); } while > (0)
Re: [patch 09/10] sched/core: Add migrate_disable/enable()
On Thu, Sep 17, 2020 at 04:38:50PM +0200, Sebastian Siewior wrote: > On 2020-09-17 16:24:38 [+0200], pet...@infradead.org wrote: > > And if I'm not mistaken, the above migrate_enable() *does* require being > > able to schedule, and our favourite piece of futex: > > > > raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); > > spin_unlock(q.lock_ptr); > > > > is broken. Consider that spin_unlock() doing migrate_enable() with a > > pending sched_setaffinity(). > > There are two instances of the above and only in the futex code and we > have sort of duct tape for that by manually balancing the migrate > counter so that it does not come to this. > But yes, not having to do the manual balance is a plus. I'm aware of the duct-tape :-) But I was under the impression that we didn't want the duct-tape, and that there was lots of issues with the FPU code, or was that another issue?
Re: [patch 09/10] sched/core: Add migrate_disable/enable()
On Thu, Sep 17, 2020 at 11:42:11AM +0200, Thomas Gleixner wrote: > +static inline void update_nr_migratory(struct task_struct *p, long delta) > +{ > + if (p->nr_cpus_allowed > 1 && p->sched_class->update_migratory) > + p->sched_class->update_migratory(p, delta); > +} Right, so as you know, I totally hate this thing :-) It adds a second (and radically different) version of changing affinity. I'm working on a version that uses the normal *set_cpus_allowed*() interface. > +/* > + * The migrate_disable/enable() fastpath updates only the tasks migrate > + * disable count which is sufficient as long as the task stays on the CPU. > + * > + * When a migrate disabled task is scheduled out it can become subject to > + * load balancing. To prevent this, update task::cpus_ptr to point to the > + * current CPUs cpumask and set task::nr_cpus_allowed to 1. > + * > + * If task::cpus_ptr does not point to task::cpus_mask then the update has > + * been done already. This check is also used in in migrate_enable() as an > + * indicator to restore task::cpus_ptr to point to task::cpus_mask > + */ > +static inline void sched_migration_ctrl(struct task_struct *prev, int cpu) > +{ > + if (!prev->migration_ctrl.disable_cnt || > + prev->cpus_ptr != &prev->cpus_mask) > + return; > + > + prev->cpus_ptr = cpumask_of(cpu); > + update_nr_migratory(prev, -1); > + prev->nr_cpus_allowed = 1; > +} So this thing is called from schedule(), with only rq->lock held, and that violates the locking rules for changing the affinity. I have a comment that explains how it's broken and why it's sort-of working. > +void migrate_disable(void) > +{ > + unsigned long flags; > + > + if (!current->migration_ctrl.disable_cnt) { > + raw_spin_lock_irqsave(¤t->pi_lock, flags); > + current->migration_ctrl.disable_cnt++; > + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); > + } else { > + current->migration_ctrl.disable_cnt++; > + } > +} That pi_lock seems unfortunate, and it isn't obvious what the point of it is. > +void migrate_enable(void) > +{ > + struct task_migrate_data *pending; > + struct task_struct *p = current; > + struct rq_flags rf; > + struct rq *rq; > + > + if (WARN_ON_ONCE(p->migration_ctrl.disable_cnt <= 0)) > + return; > + > + if (p->migration_ctrl.disable_cnt > 1) { > + p->migration_ctrl.disable_cnt--; > + return; > + } > + > + raw_spin_lock_irqsave(&p->pi_lock, rf.flags); > + p->migration_ctrl.disable_cnt = 0; > + pending = p->migration_ctrl.pending; > + p->migration_ctrl.pending = NULL; > + > + /* > + * If the task was never scheduled out while in the migrate > + * disabled region and there is no migration request pending, > + * return. > + */ > + if (!pending && p->cpus_ptr == &p->cpus_mask) { > + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); > + return; > + } > + > + rq = __task_rq_lock(p, &rf); > + /* Was it scheduled out while in a migrate disabled region? */ > + if (p->cpus_ptr != &p->cpus_mask) { > + /* Restore the tasks CPU mask and update the weight */ > + p->cpus_ptr = &p->cpus_mask; > + p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask); > + update_nr_migratory(p, 1); > + } > + > + /* If no migration request is pending, no further action required. */ > + if (!pending) { > + task_rq_unlock(rq, p, &rf); > + return; > + } > + > + /* Migrate self to the requested target */ > + pending->res = set_cpus_allowed_ptr_locked(p, pending->mask, > +pending->check, rq, &rf); > + complete(pending->done); > +} So, what I'm missing with all this are the design contraints for this trainwreck. Because the 'sane' solution was having migrate_disable() imply cpus_read_lock(). But that didn't fly because we can't have migrate_disable() / migrate_enable() schedule for raisins. And if I'm not mistaken, the above migrate_enable() *does* require being able to schedule, and our favourite piece of futex: raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); spin_unlock(q.lock_ptr); is broken. Consider that spin_unlock() doing migrate_enable() with a pending sched_setaffinity(). Let me ponder this more..
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Thu, Sep 17, 2020 at 01:48:38PM +0100, Matthew Wilcox wrote: > On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote: > > IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even > > before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this > > doesn't matter... > > > > Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid > > file_end_write() in IRQ context, but I am not sure it's worth the trouble. > > If we change bio_endio to invoke the ->bi_end_io callbacks in softirq > context instead of hardirq context, we can change the pagecache to take SoftIRQ context has exactly the same problem vs __this_cpu*().
Re: [RFC][PATCH 1/4] acpi: Use CPUIDLE_FLAG_TIMER_STOP
On Tue, Sep 15, 2020 at 06:26:52PM +0200, Rafael J. Wysocki wrote: > On Tue, Sep 15, 2020 at 12:44 PM Peter Zijlstra wrote: > > > > Make acpi_processor_idle use the common broadcast code, there's no > > reason not to. This also removes some RCU usage after > > rcu_idle_enter(). > > > > Signed-off-by: Peter Zijlstra (Intel) > > The whole series looks good to me, so please feel free to add > > Acked-by: Rafael J. Wysocki > > to all of the four patches. > > Alternatively, please let me know if you want me to take the patches. Feel free to take them. All the prerequisite borkage is in linus' tree already.
Re: [PATCH 2/2] sched/hotplug: Ensure only per-cpu kthreads run during hotplug
On Wed, Sep 16, 2020 at 03:58:17PM +0200, Sebastian Andrzej Siewior wrote: > On 2020-09-16 14:10:20 [+0200], pet...@infradead.org wrote: > > squeeze that in please: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a4fe22b8b8418..bed3cd28af578 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6866,7 +6866,7 @@ static int __balance_push_cpu_stop(void *arg) > raw_spin_lock_irq(&p->pi_lock); > rq_lock(rq, &rf); > > - update_rq_clock(); > + update_rq_clock(rq); > > if (task_rq(p) == rq && task_on_rq_queued(p)) { > cpu = select_fallback_rq(rq->cpu, p); > > > and count me in :) Duh... /me goes in search of the brown paper bag -- again!
Re: [PATCHv2] perf: Fix race in perf_mmap_close function
On Wed, Sep 16, 2020 at 01:53:11PM +0200, Jiri Olsa wrote: > There's a possible race in perf_mmap_close when checking ring buffer's > mmap_count refcount value. The problem is that the mmap_count check is > not atomic because we call atomic_dec and atomic_read separately. > > perf_mmap_close: > ... >atomic_dec(&rb->mmap_count); >... >if (atomic_read(&rb->mmap_count)) > goto out_put; > > >free_uid > > out_put: > ring_buffer_put(rb); /* could be last */ > > The race can happen when we have two (or more) events sharing same ring > buffer and they go through atomic_dec and then they both see 0 as refcount > value later in atomic_read. Then both will go on and execute code which > is meant to be run just once. The trival case should be protected by mmap_sem, we call vm_ops->close() with mmap_sem held for writing IIRC. But yes, I think it's possible to construct a failure here.
Re: [PATCH 16/26] perf tools: Synthesize modules with mmap3
On Wed, Sep 16, 2020 at 12:10:21PM -0300, Arnaldo Carvalho de Melo wrote: > Em Wed, Sep 16, 2020 at 04:17:00PM +0200, pet...@infradead.org escreveu: > > On Wed, Sep 16, 2020 at 11:07:44AM -0300, Arnaldo Carvalho de Melo wrote: > > > Em Wed, Sep 16, 2020 at 10:20:18AM +0200, Jiri Olsa escreveu: > > > > > > > IIRC BUILD_ID_SIZE is 20 bytes which is the correct size for SHA-1. A > > > > > build ID may be 128-bits (16 bytes) if md5 or uuid hashes are used. > > > > > Should this test just be "> 0" ? > > > > > > > > ah right, will check on that > > > > > > And how do you deal with this in the kernel? I.e. to inform userspace, > > > via the PERF_RECORD_MMAP3 (or MMAP2 with that misc bit trick) the size > > > of the build-id? > > > > The union size is 24 bytes, so there's plenty space to store a length > > field with the buildid. > > So, I think we should instead use a bit in the misc field, stating the > kind of build-id, so that we don't waste a byte for that, I think. There's no wastage: u32 min, maj; u64 ino; u64 ino_generation; is 24 bytes, buildit is 20 bytes, that leaves us 4 bytes to encode the buildid type without growing anything.
Re: [PATCH v2] lockdep: Allow tuning tracing capacity constants.
On Wed, Sep 16, 2020 at 01:28:19PM +0200, Dmitry Vyukov wrote: > On Fri, Sep 4, 2020 at 6:05 PM Tetsuo Handa > wrote: > > > > Hello. Can we apply this patch? > > > > This patch addresses top crashers for syzbot, and applying this patch > > will help utilizing syzbot's resource for finding other bugs. > > Acked-by: Dmitry Vyukov > > Peter, do you still have concerns with this? Yeah, I still hate it with a passion; it discourages thinking. A bad annotation that blows up the lockdep storage, no worries, we'll just increase this :/ IIRC the issue with syzbot is that the current sysfs annotation is pretty terrible and generates a gazillion classes, and syzbot likes poking at /sys a lot and thus floods the system. I don't know enough about sysfs to suggest an alternative, and haven't exactly had spare time to look into it either :/ Examples of bad annotations is getting every CPU a separate class, that leads to nr_cpus! chains if CPUs arbitrarily nest (nr_cpus^2 if there's only a single nesting level).
Re: [PATCH 16/26] perf tools: Synthesize modules with mmap3
On Wed, Sep 16, 2020 at 11:07:44AM -0300, Arnaldo Carvalho de Melo wrote: > Em Wed, Sep 16, 2020 at 10:20:18AM +0200, Jiri Olsa escreveu: > > > IIRC BUILD_ID_SIZE is 20 bytes which is the correct size for SHA-1. A > > > build ID may be 128-bits (16 bytes) if md5 or uuid hashes are used. > > > Should this test just be "> 0" ? > > > > ah right, will check on that > > And how do you deal with this in the kernel? I.e. to inform userspace, > via the PERF_RECORD_MMAP3 (or MMAP2 with that misc bit trick) the size > of the build-id? The union size is 24 bytes, so there's plenty space to store a length field with the buildid.
Re: [PATCH 2/2] sched/hotplug: Ensure only per-cpu kthreads run during hotplug
On Wed, Sep 16, 2020 at 12:18:45PM +0200, Sebastian Andrzej Siewior wrote: > With this on top of -rc5 I get: > > [ 42.678670] process 1816 (hackbench) no longer affine to cpu2 > [ 42.678684] process 1817 (hackbench) no longer affine to cpu2 > [ 42.710502] [ cut here ] > [ 42.711505] rq->clock_update_flags < RQCF_ACT_SKIP > [ 42.711514] WARNING: CPU: 2 PID: 19 at kernel/sched/sched.h:1141 > update_curr+0xe3/0x3f0 > [ 42.736635] select_fallback_rq+0x129/0x160 > [ 42.737450] __balance_push_stop+0x132/0x230 Duh.. I wonder why I didn't see that.. anyway, I think the below version ought to cure that. > That comes due to DEQUEUE_NOCLOCK in do_set_cpus_allowed(). And then > there is this: > > [ 42.752454] == > [ 42.752455] WARNING: possible circular locking dependency detected > [ 42.752455] 5.9.0-rc5+ #107 Not tainted > [ 42.752455] -- > [ 42.752456] migration/2/19 is trying to acquire lock: > [ 42.752456] 824639f8 ((console_sem).lock){-.-.}-{2:2}, at: > down_trylock+0xa/0x30 > [ 42.752458] but task is already holding lock: > [ 42.752458] 888276ca9598 (&rq->lock){-.-.}-{2:2}, at: > __balance_push_stop+0x50/0x230 > [ 42.752460] which lock already depends on the new lock. Yeah, that's the known issue with printk()... --- Subject: sched/hotplug: Ensure only per-cpu kthreads run during hotplug From: Peter Zijlstra Date: Fri Sep 11 09:54:27 CEST 2020 In preparation for migrate_disable(), make sure only per-cpu kthreads are allowed to run on !active CPUs. This is ran (as one of the very first steps) from the cpu-hotplug task which is a per-cpu kthread and completion of the hotplug operation only requires such tasks. This constraint enables the migrate_disable() implementation to wait for completion of all migrate_disable regions on this CPU at hotplug time without fear of any new ones starting. This replaces the unlikely(rq->balance_callbacks) test at the tail of context_switch with an unlikely(rq->balance_work), the fast path is not affected. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 114 ++- kernel/sched/sched.h |5 ++ 2 files changed, 117 insertions(+), 2 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3513,8 +3513,10 @@ static inline struct callback_head *spli struct callback_head *head = rq->balance_callback; lockdep_assert_held(&rq->lock); - if (head) + if (head) { rq->balance_callback = NULL; + rq->balance_flags &= ~BALANCE_WORK; + } return head; } @@ -3535,6 +3537,22 @@ static inline void balance_callbacks(str } } +static bool balance_push(struct rq *rq); + +static inline void balance_switch(struct rq *rq) +{ + if (unlikely(rq->balance_flags)) { + /* +* Run the balance_callbacks, except on hotplug +* when we need to push the current task away. +*/ + if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || + !(rq->balance_flags & BALANCE_PUSH) || + !balance_push(rq)) + __balance_callbacks(rq); + } +} + #else static inline void __balance_callbacks(struct rq *rq) @@ -3550,6 +3568,10 @@ static inline void balance_callbacks(str { } +static inline void balance_switch(struct rq *rq) +{ +} + #endif static inline void @@ -3577,7 +3599,7 @@ static inline void finish_lock_switch(st * prev into current: */ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); - __balance_callbacks(rq); + balance_switch(rq); raw_spin_unlock_irq(&rq->lock); } @@ -6836,6 +6858,89 @@ static void migrate_tasks(struct rq *dea rq->stop = stop; } + +static int __balance_push_cpu_stop(void *arg) +{ + struct task_struct *p = arg; + struct rq *rq = this_rq(); + struct rq_flags rf; + int cpu; + + raw_spin_lock_irq(&p->pi_lock); + rq_lock(rq, &rf); + + update_rq_clock(); + + if (task_rq(p) == rq && task_on_rq_queued(p)) { + cpu = select_fallback_rq(rq->cpu, p); + rq = __migrate_task(rq, &rf, p, cpu); + } + + rq_unlock(rq, &rf); + raw_spin_unlock_irq(&p->pi_lock); + + put_task_struct(p); + + return 0; +} + +static DEFINE_PER_CPU(struct cpu_stop_work, push_work); + +/* + * Ensure we only run per-cpu kthreads once the CPU goes !active. + */ +static bool balance_push(struct rq *rq) +{ + struct task_struct *push_task = rq->curr; + + lockdep_assert_held(&rq->lock); + SCHED_WARN_ON(rq->cpu != smp_processor_id()); + + /* +* Both the cpu-hotplug and stop task are in this case and are +* required to complete the hotplug process. +*/ + if (is
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Wed, Sep 16, 2020 at 08:32:20PM +0800, Hou Tao wrote: > I have simply test the performance impact on both x86 and aarch64. > > There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads > per core) Yeah, x86 is magical here, it's the same single instruction for both ;-) But it is, afaik, unique in this position, no other arch can pull that off. > However the performance degradation is huge under aarch64 (4 sockets, 24 core > per sockets): nearly 60% lost. > > v4.19.111 > no writer, reader cn | 24| 48| > 72| 96 > the rate of down_read/up_read per second | 166129572 | 166064100 | > 165963448 | 165203565 > the rate of down_read/up_read per second (patched) | 63863506 | 63842132 | > 63757267 | 63514920 Teh hurt :/
Re: [PATCH v2 0/5] seqlock: Introduce PREEMPT_RT support
On Wed, Sep 16, 2020 at 08:52:07AM -0400, Qian Cai wrote: > On Tue, 2020-09-15 at 16:30 +0200, pet...@infradead.org wrote: > > On Tue, Sep 15, 2020 at 08:48:17PM +0800, Boqun Feng wrote: > > > I think this happened because seqcount_##lockname##_init() is defined at > > > function rather than macro, so when the seqcount_init() gets expand in > > > > Bah! I hate all this :/ > > > > I suspect the below, while more verbose than I'd like is the best > > option. > > Stephen, can you add this patch for now until Peter beats you to it? Did you verify it works? I only wrote it..
Re: [PATCHv2] perf: Fix race in perf_mmap_close function
On Wed, Sep 16, 2020 at 01:53:11PM +0200, Jiri Olsa wrote: > There's a possible race in perf_mmap_close when checking ring buffer's > mmap_count refcount value. The problem is that the mmap_count check is > not atomic because we call atomic_dec and atomic_read separately. > > perf_mmap_close: > ... >atomic_dec(&rb->mmap_count); >... >if (atomic_read(&rb->mmap_count)) > goto out_put; > > >free_uid > > out_put: > ring_buffer_put(rb); /* could be last */ > > The race can happen when we have two (or more) events sharing same ring > buffer and they go through atomic_dec and then they both see 0 as refcount > value later in atomic_read. Then both will go on and execute code which > is meant to be run just once. > > The code that detaches ring buffer is probably fine to be executed more > than once, but the problem is in calling free_uid, which will later on > demonstrate in related crashes and refcount warnings, like: > > refcount_t: addition on 0; use-after-free. > ... > RIP: 0010:refcount_warn_saturate+0x6d/0xf > ... > Call Trace: > prepare_creds+0x190/0x1e0 > copy_creds+0x35/0x172 > copy_process+0x471/0x1a80 > _do_fork+0x83/0x3a0 > __do_sys_wait4+0x83/0x90 > __do_sys_clone+0x85/0xa0 > do_syscall_64+0x5b/0x1e0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Using atomic decrease and check instead of separated calls. > This fixes CVE-2020-14351. I'm tempted to remove that line; I can never seem to find anything useful in a CVE. Fixes: ? > Acked-by: Namhyung Kim > Tested-by: Michael Petlan > Signed-off-by: Jiri Olsa
Re: [PATCH v2 0/5] seqlock: Introduce PREEMPT_RT support
On Wed, Sep 16, 2020 at 09:00:59AM -0400, Qian Cai wrote: > > > - Original Message - > > On Wed, Sep 16, 2020 at 08:52:07AM -0400, Qian Cai wrote: > > > On Tue, 2020-09-15 at 16:30 +0200, pet...@infradead.org wrote: > > > > On Tue, Sep 15, 2020 at 08:48:17PM +0800, Boqun Feng wrote: > > > > > I think this happened because seqcount_##lockname##_init() is defined > > > > > at > > > > > function rather than macro, so when the seqcount_init() gets expand in > > > > > > > > Bah! I hate all this :/ > > > > > > > > I suspect the below, while more verbose than I'd like is the best > > > > option. > > > > > > Stephen, can you add this patch for now until Peter beats you to it? > > > > Did you verify it works? I only wrote it.. > > Yes, I did. Excellent, I'll stick a Tested-by from you on then.
Re: [tip:x86/seves] BUILD SUCCESS WITH WARNING e6eb15c9ba3165698488ae5c34920eea20eaa38e
On Wed, Sep 16, 2020 at 10:46:41AM +0200, Marco Elver wrote: > On Wed, 16 Sep 2020 at 10:30, wrote: > > On Tue, Sep 15, 2020 at 08:09:16PM +0200, Marco Elver wrote: > > > On Tue, 15 Sep 2020 at 19:40, Nick Desaulniers > > > wrote: > > > > On Tue, Sep 15, 2020 at 10:21 AM Borislav Petkov wrote: > > > > > > > init/calibrate.o: warning: objtool: asan.module_ctor()+0xc: call > > > > > without frame pointer save/setup > > > > > init/calibrate.o: warning: objtool: asan.module_dtor()+0xc: call > > > > > without frame pointer save/setup > > > > > init/version.o: warning: objtool: asan.module_ctor()+0xc: call > > > > > without frame pointer save/setup > > > > > init/version.o: warning: objtool: asan.module_dtor()+0xc: call > > > > > without frame pointer save/setup > > > > > certs/system_keyring.o: warning: objtool: asan.module_ctor()+0xc: > > > > > call without frame pointer save/setup > > > > > certs/system_keyring.o: warning: objtool: asan.module_dtor()+0xc: > > > > > call without frame pointer save/setup > > > > > > This one also appears with Clang 11. This is new I think because we > > > started emitting ASAN ctors for globals redzone initialization. > > > > > > I think we really do not care about precise stack frames in these > > > compiler-generated functions. So, would it be reasonable to make > > > objtool ignore all *san.module_ctor and *san.module_dtor functions (we > > > have them for ASAN, TSAN, MSAN)? > > > > The thing is, if objtool cannot follow, it cannot generate ORC data and > > our unwinder cannot unwind through the instrumentation, and that is a > > fail. > > > > Or am I missing something here? > > They aren't about the actual instrumentation. The warnings are about > module_ctor/module_dtor functions which are compiler-generated, and > these are only called on initialization/destruction (dtors only for > modules I guess). > > E.g. for KASAN it's the calls to __asan_register_globals that are > called from asan.module_ctor. For KCSAN the tsan.module_ctor is > effectively a noop (because __tsan_init() is a noop), so it really > doesn't matter much. > > Is my assumption correct that the only effect would be if something > called by them fails, we just don't see the full stack trace? I think > we can live with that, there are only few central places that deal > with ctors/dtors (do_ctors(), ...?). Not only fails, lockdep for example likes to store stack traces of various callsites etc.. Also perf (NMI) likes to think it can unwind at all times. > The "real" fix would be to teach the compilers about "frame pointer > save/setup" for generated functions, but I don't think that's > realistic. How is that unrealistic? If you build with framepointers enabled, the compiler is supposed to know about this stuff.
Re: [tip:x86/seves] BUILD SUCCESS WITH WARNING e6eb15c9ba3165698488ae5c34920eea20eaa38e
On Tue, Sep 15, 2020 at 08:09:16PM +0200, Marco Elver wrote: > On Tue, 15 Sep 2020 at 19:40, Nick Desaulniers > wrote: > > On Tue, Sep 15, 2020 at 10:21 AM Borislav Petkov wrote: > > > init/calibrate.o: warning: objtool: asan.module_ctor()+0xc: call without > > > frame pointer save/setup > > > init/calibrate.o: warning: objtool: asan.module_dtor()+0xc: call without > > > frame pointer save/setup > > > init/version.o: warning: objtool: asan.module_ctor()+0xc: call without > > > frame pointer save/setup > > > init/version.o: warning: objtool: asan.module_dtor()+0xc: call without > > > frame pointer save/setup > > > certs/system_keyring.o: warning: objtool: asan.module_ctor()+0xc: call > > > without frame pointer save/setup > > > certs/system_keyring.o: warning: objtool: asan.module_dtor()+0xc: call > > > without frame pointer save/setup > > This one also appears with Clang 11. This is new I think because we > started emitting ASAN ctors for globals redzone initialization. > > I think we really do not care about precise stack frames in these > compiler-generated functions. So, would it be reasonable to make > objtool ignore all *san.module_ctor and *san.module_dtor functions (we > have them for ASAN, TSAN, MSAN)? The thing is, if objtool cannot follow, it cannot generate ORC data and our unwinder cannot unwind through the instrumentation, and that is a fail. Or am I missing something here?
Re: [PATCH] arch: x86: power: cpu: init %gs before __restore_processor_state (clang)
On Tue, Sep 15, 2020 at 12:51:47PM -0700, Nick Desaulniers wrote: > It would be much nicer if we had the flexibility to disable stack > protectors per function, rather than per translation unit. I'm going > to encourage you to encourage your favorite compile vendor ("write to > your senator") to support the function attribute > __attribute__((no_stack_protector)) so that one day, we can use that > to stop shipping crap like a9a3ed1eff360 ("x86: Fix early boot crash > on gcc-10, third try"). I think we were all in favour of having that, not sure why it hasn't happened yet. More important matters I suppose :/
Re: [PATCH] stats: Replace seq_printf with seq_puts
On Wed, Sep 16, 2020 at 03:44:15AM +, Xu Wang wrote: > seq_puts is a lot cheaper than seq_printf, so use that to print > literal strings. performance is not a consideration here. > Signed-off-by: Xu Wang > --- > kernel/sched/stats.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c > index 750fb3c67eed..0818fe03407a 100644 > --- a/kernel/sched/stats.c > +++ b/kernel/sched/stats.c > @@ -37,7 +37,7 @@ static int show_schedstat(struct seq_file *seq, void *v) > rq->rq_cpu_time, > rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount); > > - seq_printf(seq, "\n"); > + seq_putc(seq, '\n'); And yet, the changelog says seq_put*s*(). > > #ifdef CONFIG_SMP > /* domain-specific stats */ > -- > 2.17.1 >
Re: [RFC] autonuma: Migrate on fault among multiple bound nodes
On Wed, Sep 16, 2020 at 08:59:36AM +0800, Huang Ying wrote: > So in this patch, if MPOL_BIND is used to bind the memory of the > application to multiple nodes, and in the hint page fault handler both > the faulting page node and the accessing node are in the policy > nodemask, the page will be tried to be migrated to the accessing node > to reduce the cross-node accessing. Seems fair enough.. > Questions: > > Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA > optimizing globally. And now, it appears that the explicit NUMA > memory policy specifying (e.g. via numactl, mbind(), etc.) acts like > an implicit per-thread/VMA knob to enable/disable the AutoNUMA > optimizing for the thread/VMA. Although this looks like a side effect > instead of an API, from commit fc3147245d19 ("mm: numa: Limit NUMA > scanning to migrate-on-fault VMAs"), this is used by some users? So > the question is, do we need an explicit per-thread/VMA knob to > enable/disable AutoNUMA optimizing for the thread/VMA? Or just use > the global knob, either optimize all thread/VMAs as long as the > explicitly specified memory policies are respected, or don't optimize > at all. I don't understand the question; that commit is not about disabling numa balancing, it's about avoiding pointless work and overhead. What's the point of scanning memory if you're not going to be allowed to move it anyway. > Signed-off-by: "Huang, Ying" > Cc: Andrew Morton > Cc: Ingo Molnar > Cc: Mel Gorman > Cc: Rik van Riel > Cc: Johannes Weiner > Cc: "Matthew Wilcox (Oracle)" > Cc: Dave Hansen > Cc: Andi Kleen > Cc: Michal Hocko > Cc: David Rientjes > --- > mm/mempolicy.c | 43 +++ > 1 file changed, 31 insertions(+), 12 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index eddbe4e56c73..a941eab2de24 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1827,6 +1827,13 @@ static struct mempolicy *get_vma_policy(struct > vm_area_struct *vma, > return pol; > } > > +static bool mpol_may_mof(struct mempolicy *pol) > +{ > + /* May migrate among bound nodes for MPOL_BIND */ > + return pol->flags & MPOL_F_MOF || > + (pol->mode == MPOL_BIND && nodes_weight(pol->v.nodes) > 1); > +} This is weird, why not just set F_MOF on the policy? In fact, why wouldn't something like: mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY); work today? Afaict MF_LAZY will unconditionally result in M_MOF. > @@ -2494,20 +2503,30 @@ int mpol_misplaced(struct page *page, struct > vm_area_struct *vma, unsigned long > break; > > case MPOL_BIND: > /* > + * Allows binding to multiple nodes. If both current and > + * accessing nodes are in policy nodemask, migrate to > + * accessing node to optimize page placement. Otherwise, > + * use current page if in policy nodemask or MPOL_F_MOF not > + * set, else select nearest allowed node, if any. If no > + * allowed nodes, use current [!misplaced]. >*/ > + if (node_isset(curnid, pol->v.nodes)) { > + if (node_isset(thisnid, pol->v.nodes)) { > + moron = true; > + polnid = thisnid; > + } else { > + goto out; > + } > + } else if (!(pol->flags & MPOL_F_MOF)) { > goto out; > + } else { > + z = first_zones_zonelist( > node_zonelist(numa_node_id(), GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > &pol->v.nodes); > + polnid = zone_to_nid(z->zone); > + } > break; > > default: Did that want to be this instead? I don't think I follow the other changes. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eddbe4e56c73..2a64913f9ac6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2501,8 +2501,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ - if (node_isset(curnid, pol->v.nodes)) + if (node_isset(curnid, pol->v.nodes)) { + if (node_isset(thisnod, pol->v.nodes)) + goto moron; goto out; + } z = first_zones_zonelist( node_zonelist(numa_node_id(), GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), @@ -2516,6 +2519,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long /* Migrate the page towards the node whose CPU is referencing it */ if (pol->flags & MPOL_
Re: [PATCH] x86/smap: Fix the smap_save() asm
On Tue, Sep 15, 2020 at 01:55:58PM -0700, Andy Lutomirski wrote: > The old smap_save() code was: > > pushf > pop %0 > > with %0 defined by an "=rm" constraint. This is fine if the > compiler picked the register option, but it was incorrect with an > %rsp-relative memory operand. With some intentional abuse, I can > get both gcc and clang to generate code along these lines: > > pushfq > popq 0x8(%rsp) > mov 0x8(%rsp),%rax > > which is incorrect and will not work as intended. We need another constraint :-) > Fix it by removing the memory option. This issue is exacerbated by > a clang optimization bug: > > https://bugs.llvm.org/show_bug.cgi?id=47530 > > Fixes: e74deb11931f ("x86/uaccess: Introduce user_access_{save,restore}()") > Cc: sta...@vger.kernel.org > Reported-by: Bill Wendling # I think > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/asm/smap.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/smap.h b/arch/x86/include/asm/smap.h > index 8b58d6975d5d..be6d675ae3ac 100644 > --- a/arch/x86/include/asm/smap.h > +++ b/arch/x86/include/asm/smap.h > @@ -61,7 +61,7 @@ static __always_inline unsigned long smap_save(void) > ALTERNATIVE("jmp 1f", "", X86_FEATURE_SMAP) > "pushf; pop %0; " __ASM_CLAC "\n\t" > "1:" > - : "=rm" (flags) : : "memory", "cc"); > + : "=r" (flags) : : "memory", "cc"); native_save_fl() has the exact same code; you'll need to fix both.
Re: [PATCH v2 0/5] seqlock: Introduce PREEMPT_RT support
On Tue, Sep 15, 2020 at 08:48:17PM +0800, Boqun Feng wrote: > I think this happened because seqcount_##lockname##_init() is defined at > function rather than macro, so when the seqcount_init() gets expand in Bah! I hate all this :/ I suspect the below, while more verbose than I'd like is the best option. --- include/linux/seqlock.h | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index f73c7eb68f27..76e44e6c0100 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -173,6 +173,19 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) * @lock: Pointer to the associated lock */ +#define seqcount_LOCKNAME_init(s, _lock, lockname) \ + do {\ + seqcount_##lockname##_t *s = (s); \ + seqcount_init(&s->seqcount);\ + __SEQ_LOCK(s->lock = (_lock)); \ + } while (0) + +#define seqcount_raw_spinlock_init(s, lock)seqcount_LOCKNAME_init(s, lock, raw_spinlock) +#define seqcount_spinlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, spinlock) +#define seqcount_rwlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, rwlock); +#define seqcount_mutex_init(s, lock) seqcount_LOCKNAME_init(s, lock, mutex); +#define seqcount_ww_mutex_init(s, lock) seqcount_LOCKNAME_init(s, lock, ww_mutex); + /* * SEQCOUNT_LOCKNAME() - Instantiate seqcount_LOCKNAME_t and helpers * seqprop_LOCKNAME_*()- Property accessors for seqcount_LOCKNAME_t @@ -190,13 +203,6 @@ typedef struct seqcount_##lockname { \ __SEQ_LOCK(locktype *lock); \ } seqcount_##lockname##_t; \ \ -static __always_inline void\ -seqcount_##lockname##_init(seqcount_##lockname##_t *s, locktype *lock) \ -{ \ - seqcount_init(&s->seqcount);\ - __SEQ_LOCK(s->lock = lock); \ -} \ - \ static __always_inline seqcount_t *\ __seqprop_##lockname##_ptr(seqcount_##lockname##_t *s) \ { \ @@ -284,8 +290,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __SEQ_LOCK(.lock= (assoc_lock)) \ } -#define SEQCNT_SPINLOCK_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, lock) #define SEQCNT_RAW_SPINLOCK_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, lock) +#define SEQCNT_SPINLOCK_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, lock) #define SEQCNT_RWLOCK_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, lock) #define SEQCNT_MUTEX_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, lock) #define SEQCNT_WW_MUTEX_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, lock)
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote: > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent. How's this? --- Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count From: Hou Tao Date: Tue, 15 Sep 2020 22:07:50 +0800 From: Hou Tao The __this_cpu*() accessors are (in general) IRQ-unsafe which, given that percpu-rwsem is a blocking primitive, should be just fine. However, file_end_write() is used from IRQ context and will cause load-store issues. Fixing it by using the IRQ-safe this_cpu_*() for operations on read_count. This will generate more expensive code on a number of platforms, which might cause a performance regression for some of the other percpu-rwsem users. If any such is reported, we can consider alternative solutions. Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes") Signed-off-by: Hou Tao Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com --- include/linux/percpu-rwsem.h |8 kernel/locking/percpu-rwsem.c |4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -60,7 +60,7 @@ static inline void percpu_down_read(stru * anything we did within this RCU-sched read-size critical section. */ if (likely(rcu_sync_is_idle(&sem->rss))) - __this_cpu_inc(*sem->read_count); + this_cpu_inc(*sem->read_count); else __percpu_down_read(sem, false); /* Unconditional memory barrier */ /* @@ -79,7 +79,7 @@ static inline bool percpu_down_read_tryl * Same as in percpu_down_read(). */ if (likely(rcu_sync_is_idle(&sem->rss))) - __this_cpu_inc(*sem->read_count); + this_cpu_inc(*sem->read_count); else ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */ preempt_enable(); @@ -103,7 +103,7 @@ static inline void percpu_up_read(struct * Same as in percpu_down_read(). */ if (likely(rcu_sync_is_idle(&sem->rss))) { - __this_cpu_dec(*sem->read_count); + this_cpu_dec(*sem->read_count); } else { /* * slowpath; reader will only ever wake a single blocked @@ -115,7 +115,7 @@ static inline void percpu_up_read(struct * aggregate zero, as that is the only time it matters) they * will also see our critical section. */ - __this_cpu_dec(*sem->read_count); + this_cpu_dec(*sem->read_count); rcuwait_wake_up(&sem->writer); } preempt_enable(); --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem); static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { - __this_cpu_inc(*sem->read_count); + this_cpu_inc(*sem->read_count); /* * Due to having preemption disabled the decrement happens on @@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(s if (likely(!atomic_read_acquire(&sem->block))) return true; - __this_cpu_dec(*sem->read_count); + this_cpu_dec(*sem->read_count); /* Prod writer to re-evaluate readers_active_check() */ rcuwait_wake_up(&sem->writer);
Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote: > This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d. > > This commit causes the kernel to oops and reboot when injecting a SLB > multihit which causes a MCE. > > Before this commit a SLB multihit was corrected by the kernel and the > system continued to operate normally. > > cc: sta...@vger.kernel.org > Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI > accounting") > Signed-off-by: Michal Suchanek Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()") nmi_enter() supports nesting natively.
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Tue, Sep 15, 2020 at 05:11:23PM +0100, Will Deacon wrote: > On Tue, Sep 15, 2020 at 06:03:44PM +0200, pet...@infradead.org wrote: > > On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote: > > > > > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent. > > > > How's this? > > > > --- > > Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count > > From: Hou Tao > > Date: Tue, 15 Sep 2020 22:07:50 +0800 > > > > From: Hou Tao > > > > The __this_cpu*() accessors are (in general) IRQ-unsafe which, given > > that percpu-rwsem is a blocking primitive, should be just fine. > > > > However, file_end_write() is used from IRQ context and will cause > > load-store issues. > > ... on architectures where the per-cpu accessors are not atomic. That's not entirely accurate, on x86 for example the per-cpu ops are not atomic, but they are not susceptible to this problem due to them being a single instruction from the point of interrupts -- either they wholly happen or they don't. So I'd reformulate it like: "... on architectures where the per-cpu accessors are not natively irq-safe" ?
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Tue, Sep 15, 2020 at 05:31:14PM +0200, Oleg Nesterov wrote: > > So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from > > IRQ context is totally out of spec. That said, we've (grudgingly) > > accomodated them before. > > Yes, I didn't expect percpu_up_ can be called from IRQ :/ Yeah, me neither. That's well out of spec for a blocking primitive in general. > > This seems to be a fairly long standing issue, and certainly not unique > > to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h, > > should be similarly affected I think). The issue seems to stem from > > Oleg's original rewrite: > > > > a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers > > unnecessarily") > > Not really... I think it was 70fe2f48152e ("aio: fix freeze protection of aio > writes"). Ah, that came later? Fair enough, I'll change the Fixes line. > And iiuc io_uring does the same. Indeed, I just went through a bunch of the file_end_write() callers. > > and is certainly an understandable mistake. > > > > I'm torn on what to do, using this_cpu over __this_cpu is going to > > adversely affect code-gen (and possibly performance) for all the > > percpu-rwsem users that are not quite so 'creative'. > > Yes, but what else can we do? Well, I just talked about it with Will, there's a bunch of things we could do, but they're all quite ugly. My leading alternative was adding: percpu_down_read_irqsafe() / percpu_up_read_irqsafe(), which use local_irq_save() instead of preempt_disable(). But blergh.. Will also argued that by going with this patch, we'll get an affected workload when someone reports a performance regression, which I suppose is a bonus. Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count
On Tue, Sep 15, 2020 at 10:07:50PM +0800, Hou Tao wrote: > Under aarch64, __this_cpu_inc() is neither IRQ-safe nor atomic, so > when percpu_up_read() is invoked under IRQ-context (e.g. aio completion), > and it interrupts the process on the same CPU which is invoking > percpu_down_read(), the decreasement on read_count may lost and > the final value of read_count on the CPU will be unexpected > as shown below: > Fixing it by using the IRQ-safe helper this_cpu_inc|dec() for > operations on read_count. > > Another plausible fix is to state that percpu-rwsem can NOT be > used under IRQ context and convert all users which may > use it under IRQ context. *groan*... So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from IRQ context is totally out of spec. That said, we've (grudgingly) accomodated them before. This seems to be a fairly long standing issue, and certainly not unique to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h, should be similarly affected I think). The issue seems to stem from Oleg's original rewrite: a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers unnecessarily") and is certainly an understandable mistake. I'm torn on what to do, using this_cpu over __this_cpu is going to adversely affect code-gen (and possibly performance) for all the percpu-rwsem users that are not quite so 'creative'.
Re: WARNING: suspicious RCU usage: race/events/tlb.h:57 suspicious rcu_dereference_check() usage!
On Mon, Sep 14, 2020 at 01:29:34PM -0400, Qian Cai wrote: > On Wed, 2020-09-09 at 10:08 +0530, Naresh Kamboju wrote: > > While booting x86_64 with Linux next 20200908 tag kernel this warning > > was noticed. > > This pretty much looks like the same issue in: > > https://lore.kernel.org/lkml/20200902035146.ga45...@roeck-us.net/ > > Can you revert the patchset to see if it is related? Don't bother.. I'll be sending patches soon.
Re: Static call dependency on libelf version?
On Tue, Sep 15, 2020 at 12:50:54AM -0700, Hugh Dickins wrote: > This is just an FYI written from a position of ignorance: I may > have got it wrong, and my build environment too piecemeal to matter > to anyone else; but what I saw was weird enough to be worth mentioning, > in case it saves someone some time. > > I usually build and test on mmotm weekly rather linux-next daily. > No problem with 5.9-rc3-mm1 from 2020-09-04, nor with 5.9-rc5, but > (on two machines) 5.9-rc5-mm1 from 2020-09-13 could not link vmlinux: > > AR init/built-in.a > LD vmlinux.o > ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) > larger than the file size > ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) > larger than the file size > ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) > larger than the file size > ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) > larger than the file size > ld: init/built-in.a: member init/main.o in archive is not an object > make[1]: *** [vmlinux] Error 1 > make: *** [__sub-make] Error 2 > > On the third machine, a more recent installation, but using the same > gcc and the same binutils, I could build the same config successfully. > init/main.o was the same size on each (49216 bytes), but diff of hd > of the good against the bad showed: > > 2702,2709c2702,2709 > < 00bfc0 01db 0001 0003 >< > < 00bfd0 b316 >< > < 00bfe0 0018 >< > < 00bff0 0001 0008 >< > < 00c000 01ee 0004 0040 >@...< > < 00c010 b330 >0...< > < 00c020 0090 002d 0030 >-...0...< > < 00c030 0008 0018 >< > --- > > 00bfc0 01f1 >< > > 00bfd0 79732e00 6261746d 74732e00 62617472 >..symtab..strtab< > > 00bfe0 68732e00 74727473 2e006261 616c6572 >..shstrtab..rela< > > 00bff0 7865742e 722e0074 2e616c65 61746164 >.text..rela.data< > > 00c000 73622e00 722e0073 5f616c65 6172745f >..bss..rela__tra< > > 00c010 6f706563 73746e69 7274705f 722e0073 >cepoints_ptrs..r< > > 00c020 2e616c65 74617473 635f6369 2e6c6c61 >ela.static_call.< > > 00c030 74786574 65722e00 692e616c 2e74696e >text..rela.init.< > > and 217 other .os in the build tree also "corrupted". > > CONFIG_HAVE_STATIC_CALL=y > CONFIG_HAVE_STATIC_CALL_INLINE=y > stand out as new in the .config for 5.9-rc5-mm1, and references > to objtool in static_call.h and static_call_types.h took me to > tools/objtool/Makefile, with its use of libelf. > > I've copied over files of the newer libelf (0.168) to the failing > machines, which are now building the 5.9-rc5-mm1 vmlinux correctly. > > It looks as if CONFIG_HAVE_STATIC_CALL=y depends on a newer libelf > than I had before (0.155), and should either insist on a minimum > version, or else be adjusted to work with older versions. Hurmph, I have no idea how this happened; clearly none of my machines have this older libelf :/ (the machines I use most seem to be on 0.180). I'm also not sure what static_call is doing different from say orc data generation. Both create and fill sections in similar ways. Mark, do you have any idea?
Re: [PATCH] x86/unwind/fp: Fix FP unwinding in ret_from_fork
On Mon, Sep 14, 2020 at 12:04:22PM -0500, Josh Poimboeuf wrote: > There have been some reports of "bad bp value" warnings printed by the > frame pointer unwinder: > > WARNING: kernel stack regs at 5bac7112 in sh:1014 has bad 'bp' > value > > This warning happens when unwinding from an interrupt in > ret_from_fork(). If entry code gets interrupted, the state of the frame > pointer (rbp) may be undefined, which can confuse the unwinder, > resulting in warnings like the above. > > There's an in_entry_code() check which normally silences such warnings > for entry code. But in this case, ret_from_fork() is getting > interrupted. It recently got moved out of .entry.text, so the > in_entry_code() check no longer works. > > We could move it back into .entry.text, but that would break the noinstr > validation because of the call to schedule_tail(). > > Instead, initialize each new task's RBP to point to the task's entry > regs via an encoded frame pointer. That will allow the unwinder to > reach the end of the stack gracefully. > > Fixes: b9f6976bfb94 ("x86/entry/64: Move non entry code into .text section") > Reported-by: Naresh Kamboju > Reported-by: Logan Gunthorpe > Signed-off-by: Josh Poimboeuf Acked-by: Peter Zijlstra (Intel)
Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event
On Mon, Sep 14, 2020 at 12:28:41PM -0300, Arnaldo Carvalho de Melo wrote: > > > struct { > > > struct perf_event_header header; > > > > u32 pid, tid; > > > u64 addr; > > > u64 len; > > > u64 pgoff; > > > u32 maj; > > > u32 min; > > > u64 ino; > > > u64 ino_generation; > > > u32 prot, flags; > > > u32 reserved; > > What for this reserved? its all nicely aligned already, u64 followed by > two u32 (prot, flags). I suspect it is so that sizeof(reserve+buildid) is a multiple of 8. But yes, that's a wee bit daft, since the next field is a variable length character array. > > > u8 buildid[20]; > > > Do we need maj, min, ino, ino_generation for mmap3 event? > > I think they are to compare binaries, then we can do it with > > build-id (and I think it'd be better).. > > Humm, I thought MMAP2 would be a superset of MMAP and MMAP3 would be a > superset of MMAP2. Well, the 'funny' thing is that if you do use buildid, then {maj,min,ino,ino_generation} are indeed superfluous, but are combined also large enough to contain buildid. > If we want to ditch useless stuff, then trow away pid, tid too, as we > can select those via sample_type. Correct. So something like: struct { struct perf_event_header header; u64 addr; u64 len; u64 pgoff; union { struct { u32 maj; u32 min; u64 ino; u64 ino_generation; }; u8 buildid[20]; }; u32 prot, flags; char filename[]; struct sample_id sample_id; }; Using one of the MISC bits to resolve the union. Might actually bring benefit to everyone. Us normal people get to have a smaller MMAP record, while the buildid folks can have it too. Even more extreme would be using 2 MISC bits and allowing the union to be 0 sized for anon. That said; I have the nagging feeling there were unresolved issues with mmap2, but I can't seem to find any relevant emails on it :/ My google-fu is weak today.
Re: [PATCH V6 1/1] Softirq:avoid large sched delay from the pending softirqs
On Mon, Sep 14, 2020 at 12:27:35PM +0100, Qais Yousef wrote: > What does PREEMPT_RT do to deal with softirqs delays? Makes the lot preemptible, you found the patch below. > I have tried playing with enabling threadirqs, which AFAIU should make > softirqs > preemptible, right? Not yet,.. > I realize this patch is still missing from mainline at least: > > > https://gitlab.com/kalilinux/packages/linux/blob/a17bad0db9da44cd73f594794a58cc5646393b13/debian/patches-rt/softirq-Add-preemptible-softirq.patch > > Would this be a heavy handed approach to make available for non PREEMPT_RT > kernels? Not sure, I suspect it relies on migrate_disable(), which is preempt_disable() on !RT and then we're back to square one. > I only worry about potential NET_RX throughput issues. Which by the way is > protected with preempt_disable currently in mainline. See netif_rx_ni(). So preempt_disable() isn't necessairily a problem, you just want it to terminate soonish after need_resched() becomes true. Also, I'm having a wee problem getting from net_rx_action() to netif_rx_ni() > I am guessing here, but I suspect this NET_RX softirq is one source of big > delays when network activity is high. Well, one approach is to more agressively limit how long softirq processing can run. Current measures are very soft in that regard.
Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote: > Vincent Guittot (4): > sched/fair: relax constraint on task's load during load balance > sched/fair: reduce minimal imbalance threshold > sched/fair: minimize concurrent LBs between domain level > sched/fair: reduce busy load balance interval I see nothing objectionable there, a little more testing can't hurt, but I'm tempted to apply them. Phil, Mel, any chance you can run them through your respective setups?
Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote: > Reading and modifying current->mm and current->active_mm and switching > mm should be done with irqs off, to prevent races seeing an intermediate > state. > > This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB > invalidate"). At exec-time when the new mm is activated, the old one > should usually be single-threaded and no longer used, unless something > else is holding an mm_users reference (which may be possible). > > Absent other mm_users, there is also a race with preemption and lazy tlb > switching. Consider the kernel_execve case where the current thread is > using a lazy tlb active mm: > > call_usermodehelper() > kernel_execve() > old_mm = current->mm; > active_mm = current->active_mm; > *** preempt *** > schedule() >prev->active_mm = NULL; >mmdrop(prev active_mm); > ... > < schedule() > current->mm = mm; > current->active_mm = mm; > if (!old_mm) > mmdrop(active_mm); > > If we switch back to the kernel thread from a different mm, there is a > double free of the old active_mm, and a missing free of the new one. > > Closing this race only requires interrupts to be disabled while ->mm > and ->active_mm are being switched, but the TLB problem requires also > holding interrupts off over activate_mm. Unfortunately not all archs > can do that yet, e.g., arm defers the switch if irqs are disabled and > expects finish_arch_post_lock_switch() to be called to complete the > flush; um takes a blocking lock in activate_mm(). > > So as a first step, disable interrupts across the mm/active_mm updates > to close the lazy tlb preempt race, and provide an arch option to > extend that to activate_mm which allows architectures doing IPI based > TLB shootdowns to close the second race. > > This is a bit ugly, but in the interest of fixing the bug and backporting > before all architectures are converted this is a compromise. > > Signed-off-by: Nicholas Piggin Acked-by: Peter Zijlstra (Intel) I'm thinking we want this selected on x86 as well. Andy? > --- > arch/Kconfig | 7 +++ > fs/exec.c| 17 +++-- > 2 files changed, 22 insertions(+), 2 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index af14a567b493..94821e3f94d1 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER > bool > depends on MMU_GATHER_TABLE_FREE > > +config ARCH_WANT_IRQS_OFF_ACTIVATE_MM > + bool > + help > + Temporary select until all architectures can be converted to have > + irqs disabled over activate_mm. Architectures that do IPI based TLB > + shootdowns should enable this. > + > config ARCH_HAVE_NMI_SAFE_CMPXCHG > bool > > diff --git a/fs/exec.c b/fs/exec.c > index a91003e28eaa..d4fb18baf1fb 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm) > } > > task_lock(tsk); > - active_mm = tsk->active_mm; > membarrier_exec_mmap(mm); > - tsk->mm = mm; > + > + local_irq_disable(); > + active_mm = tsk->active_mm; > tsk->active_mm = mm; > + tsk->mm = mm; > + /* > + * This prevents preemption while active_mm is being loaded and > + * it and mm are being updated, which could cause problems for > + * lazy tlb mm refcounting when these are updated by context > + * switches. Not all architectures can handle irqs off over > + * activate_mm yet. > + */ > + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) > + local_irq_enable(); > activate_mm(active_mm, mm); > + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) > + local_irq_enable(); > tsk->mm->vmacache_seqnum = 0; > vmacache_flush(tsk); > task_unlock(tsk); > -- > 2.23.0 >
Re: [PATCH 1/6] x86: events: Avoid TIF_IA32 when checking 64bit mode
On Sat, Sep 12, 2020 at 03:05:48AM -0400, Gabriel Krisman Bertazi wrote: > In preparation to remove TIF_IA32, stop using it in perf events code. > > Tested by running perf on 32-bit, 64-bit and x32 applications. > > Suggested-by: Andy Lutomirski > Signed-off-by: Gabriel Krisman Bertazi Acked-by: Peter Zijlstra (Intel)
Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event
On Sun, Sep 13, 2020 at 11:02:49PM +0200, Jiri Olsa wrote: > Add new version of mmap event. The MMAP3 record is an > augmented version of MMAP2, it adds build id value to > identify the exact binary object behind memory map: > > struct { > struct perf_event_header header; > > u32 pid, tid; > u64 addr; > u64 len; > u64 pgoff; > u32 maj; > u32 min; > u64 ino; > u64 ino_generation; > u32 prot, flags; > u32 reserved; > u8 buildid[20]; > char filename[]; > struct sample_id sample_id; > }; > So weren't there still open problems with mmap2 that also needed addressing? I seem to have forgotten :/
Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event
On Sun, Sep 13, 2020 at 11:41:00PM -0700, Stephane Eranian wrote: > On Sun, Sep 13, 2020 at 2:03 PM Jiri Olsa wrote: > what happens if I set mmap3 and mmap2? > > I think using mmap3 for every mmap may be overkill as you add useless > 20 bytes to an mmap record. > I am not sure if your code handles the case where mmap3 is not needed > because there is no buildid, e.g, anonymous memory. > It seems to me you've written the patch in such a way that if the user > tool supports mmap3, then it supersedes mmap2, and thus > you need all the fields of mmap2. But if could be more interesting to > return either MMAP2 or MMAP3 depending on tool support > and type of mmap, that would certainly save 20 bytes on any anon mmap. > But maybe that logic is already in your patch and I missed it. That, and what if you don't want any of that buildid nonsense at all? I always kill that because it makes perf pointlessly slow and has absolutely no upsides for me.
Re: [PATCH V6 1/1] Softirq:avoid large sched delay from the pending softirqs
On Fri, Sep 11, 2020 at 05:46:45PM +0100, Qais Yousef wrote: > On 09/09/20 17:09, qianjun.ker...@gmail.com wrote: > > From: jun qian > > > > When get the pending softirqs, it need to process all the pending > > softirqs in the while loop. If the processing time of each pending > > softirq is need more than 2 msec in this loop, or one of the softirq > > will running a long time, according to the original code logic, it > > will process all the pending softirqs without wakeuping ksoftirqd, > > which will cause a relatively large scheduling delay on the > > corresponding CPU, which we do not wish to see. The patch will check > > the total time to process pending softirq, if the time exceeds 2 ms > > we need to wakeup the ksofirqd to aviod large sched delay. > > > > Signed-off-by: jun qian > > In Android there's a patch that tries to avoid schedling an RT task on a cpu > that is running softirqs. I wonder if this patch helps with this case. > > https://android.googlesource.com/kernel/msm/+/5c3f54c34acf4d9ed01530288d4a98acff815d79%5E%21/#F0 > > John, Wei, is this something of interest to you? Urgh.. that's pretty gross. I think the sane approach is indeed getting softirqs to react to need_resched() better.
Re: [PATCH 1/2] sched: Fix balance_callback()
On Fri, Sep 11, 2020 at 01:17:02PM +0100, Valentin Schneider wrote: > On 11/09/20 09:17, Peter Zijlstra wrote: > > The intent of balance_callback() has always been to delay executing > > balancing operations until the end of the current rq->lock section. > > This is because balance operations must often drop rq->lock, and that > > isn't safe in general. > > > > However, as noted by Scott, there were a few holes in that scheme; > > balance_callback() was called after rq->lock was dropped, which means > > another CPU can interleave and touch the callback list. > > > > So that can be say __schedule() tail racing with some setprio; what's the > worst that can (currently) happen here? Something like say two consecutive > enqueuing of push_rt_tasks() to the callback list? Yeah, but that isn't in fact the case I worry most about. What can happen (and what I've spotted once before) is that someone attempts to enqueue a balance_callback from a rq->lock region that doesn't handle the calls. Currently that 'works', that is, it will get ran _eventually_. But ideally we'd want that to not work and issue a WARN. We want the callbacks to be timely. So basically all of these machinations we in order to add the WARN :-)
Re: [PATCH 2/2] sched/hotplug: Ensure only per-cpu kthreads run during hotplug
On Fri, Sep 11, 2020 at 01:17:45PM +0100, Valentin Schneider wrote: > > @@ -6968,6 +7064,8 @@ int sched_cpu_deactivate(unsigned int cp > >*/ > > synchronize_rcu(); > > > > + balance_push_set(cpu, true); > > + > > IIUC this is going to make every subsequent finish_lock_switch() > migrate the switched-to task if it isn't a pcpu kthread. So this is going > to lead to a dance of > > switch_to() -> switch_to() -> switch_to() -> > switch_to() ... > > Wouldn't it be better to batch all those in a migrate_tasks() sibling that > skips pcpu kthreads? That's 'difficult', this is hotplug, performance is not a consideration. Basically we don't have an iterator for the runqueues, so finding these tasks is hard.
Re: [PATCH] ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms
On Fri, Sep 11, 2020 at 03:55:22PM +0300, Adrian Hunter wrote: > On 11/09/20 2:41 pm, pet...@infradead.org wrote: > > On Tue, Sep 01, 2020 at 12:16:17PM +0300, Adrian Hunter wrote: > >> Add synchronize_rcu() after list_del_rcu() in > >> ftrace_remove_trampoline_from_kallsyms() to protect readers of > >> ftrace_ops_trampoline_list (in ftrace_get_trampoline_kallsym) > >> which is used when kallsyms is read. > >> > >> Fixes: fc0ea795f53c8d ("ftrace: Add symbols for ftrace trampolines") > >> Signed-off-by: Adrian Hunter > >> --- > >> kernel/trace/ftrace.c | 1 + > >> 1 file changed, 1 insertion(+) > >> > >> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > >> index 275441254bb5..4e64367c9774 100644 > >> --- a/kernel/trace/ftrace.c > >> +++ b/kernel/trace/ftrace.c > >> @@ -2782,6 +2782,7 @@ static void > >> ftrace_remove_trampoline_from_kallsyms(struct ftrace_ops *ops) > >> { > >>lockdep_assert_held(&ftrace_lock); > >>list_del_rcu(&ops->list); > >> + synchronize_rcu(); > >> } > > > > > > Hurmph, we've just done a ton of that: > > > > > > ftrace_shutdown() > > synchronize_rcu_tasks_rude() > > ftrace_trampoline_free() > > ftrace_remove_trampoline_from_kallsyms() > > > > > > So would it not be better to move that call before the existing > > synchronize_rcu_tasks stuff rather than adding another synchronize_rcu() > > call? > > Doesn't that mean removing the symbol while the trampoline is potentially > still in use? Hurm.. potentially yes. OK, lets do this first.
Re: [PATCH V6 1/1] Softirq:avoid large sched delay from the pending softirqs
On Wed, Sep 09, 2020 at 05:09:31PM +0800, qianjun.ker...@gmail.com wrote: > From: jun qian > > When get the pending softirqs, it need to process all the pending > softirqs in the while loop. If the processing time of each pending > softirq is need more than 2 msec in this loop, or one of the softirq > will running a long time, according to the original code logic, it > will process all the pending softirqs without wakeuping ksoftirqd, > which will cause a relatively large scheduling delay on the > corresponding CPU, which we do not wish to see. The patch will check > the total time to process pending softirq, if the time exceeds 2 ms > we need to wakeup the ksofirqd to aviod large sched delay. But what is all that unreadaable gibberish with pending_new_{flag,bit} ? Random comments below.. > +#define MAX_SOFTIRQ_TIME_NS 200 2*NSEC_PER_MSEC > +DEFINE_PER_CPU(__u32, pending_new_flag); > +DEFINE_PER_CPU(__u32, pending_next_bit); __u32 is for userspace ABI, this is not it, use u32 > +#define SOFTIRQ_PENDING_MASK ((1UL << NR_SOFTIRQS) - 1) > + > asmlinkage __visible void __softirq_entry __do_softirq(void) > { > - unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > + u64 end = sched_clock() + MAX_SOFTIRQ_TIME_NS; > unsigned long old_flags = current->flags; > int max_restart = MAX_SOFTIRQ_RESTART; > struct softirq_action *h; > bool in_hardirq; > - __u32 pending; > - int softirq_bit; > + __u32 pending, pending_left, pending_new; > + int softirq_bit, next_bit; > + unsigned long flags; > > /* >* Mask out PF_MEMALLOC as the current task context is borrowed for the > @@ -277,10 +282,33 @@ asmlinkage __visible void __softirq_entry > __do_softirq(void) > > h = softirq_vec; > > - while ((softirq_bit = ffs(pending))) { > - unsigned int vec_nr; > + next_bit = per_cpu(pending_next_bit, smp_processor_id()); > + per_cpu(pending_new_flag, smp_processor_id()) = 0; __this_cpu_read() / __this_cpu_write() > + > + pending_left = pending & > + (SOFTIRQ_PENDING_MASK << next_bit); > + pending_new = pending & > + (SOFTIRQ_PENDING_MASK >> (NR_SOFTIRQS - next_bit)); The second mask is the inverse of the first. > + /* > + * In order to be fair, we shold process the pengding bits by the > + * last processing order. > + */ > + while ((softirq_bit = ffs(pending_left)) || > + (softirq_bit = ffs(pending_new))) { > int prev_count; > + unsigned int vec_nr = 0; > > + /* > + * when the left pengding bits have been handled, we should > + * to reset the h to softirq_vec. > + */ > + if (!ffs(pending_left)) { > + if (per_cpu(pending_new_flag, smp_processor_id()) == 0) > { > + h = softirq_vec; > + per_cpu(pending_new_flag, smp_processor_id()) = > 1; > + } > + } > h += softirq_bit - 1; > > vec_nr = h - softirq_vec; > @@ -298,17 +326,44 @@ asmlinkage __visible void __softirq_entry > __do_softirq(void) > preempt_count_set(prev_count); > } > h++; > - pending >>= softirq_bit; > + > + if (ffs(pending_left)) This is the _third_ ffs(pending_left), those things are _expensive_ (on some archs, see include/asm-generic/bitops/__ffs.h). > + pending_left >>= softirq_bit; > + else > + pending_new >>= softirq_bit; > + > + /* > + * the softirq's action has been run too much time, > + * so it may need to wakeup the ksoftirqd > + */ > + if (need_resched() && sched_clock() > end) { > + /* > + * Ensure that the remaining pending bits will be > + * handled. > + */ > + local_irq_save(flags); > + if (ffs(pending_left)) *fourth*... > + or_softirq_pending((pending_left << (vec_nr + > 1)) | > + pending_new); > + else > + or_softirq_pending(pending_new << (vec_nr + 1)); > + local_irq_restore(flags); > + per_cpu(pending_next_bit, smp_processor_id()) = vec_nr > + 1; > + break; > + } > } > > + /* reset the pending_next_bit */ > + per_cpu(pending_next_bit, smp_processor_id()) = 0; > + > if (__this_cpu_read(ksoftirqd) == current) > rcu_softirq_qs(); > local_irq_disable(); > > pending = local_softirq_pending(); > if (pending) { > - if (time_before(jiffies, end) && !need_resched() &&
Re: [PATCH] ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms
On Tue, Sep 01, 2020 at 12:16:17PM +0300, Adrian Hunter wrote: > Add synchronize_rcu() after list_del_rcu() in > ftrace_remove_trampoline_from_kallsyms() to protect readers of > ftrace_ops_trampoline_list (in ftrace_get_trampoline_kallsym) > which is used when kallsyms is read. > > Fixes: fc0ea795f53c8d ("ftrace: Add symbols for ftrace trampolines") > Signed-off-by: Adrian Hunter > --- > kernel/trace/ftrace.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 275441254bb5..4e64367c9774 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -2782,6 +2782,7 @@ static void > ftrace_remove_trampoline_from_kallsyms(struct ftrace_ops *ops) > { > lockdep_assert_held(&ftrace_lock); > list_del_rcu(&ops->list); > + synchronize_rcu(); > } Hurmph, we've just done a ton of that: ftrace_shutdown() synchronize_rcu_tasks_rude() ftrace_trampoline_free() ftrace_remove_trampoline_from_kallsyms() So would it not be better to move that call before the existing synchronize_rcu_tasks stuff rather than adding another synchronize_rcu() call?
Re: [PATCH 1/4] cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()
On Wed, Sep 02, 2020 at 12:54:41PM +0530, Viresh Kumar wrote: > + atomic_t reset_pending; > + atomic_set(&stats->reset_pending, 0); > + if (atomic_read(&stats->reset_pending)) > + bool pending = atomic_read(&stats->reset_pending); > + atomic_set(&stats->reset_pending, 1); > + bool pending = atomic_read(&stats->reset_pending); > + if (atomic_read(&stats->reset_pending)) What do you think atomic_t is doing for you?
Re: [PATCH v6 6/9] kernel: entry: Support Syscall User Dispatch for common syscall entry
On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote: > Syscall User Dispatch (SUD) must take precedence over seccomp, since the > use case is emulation (it can be invoked with a different ABI) such that > seccomp filtering by syscall number doesn't make sense in the first > place. In addition, either the syscall is dispatched back to userspace, > in which case there is no resource for seccomp to protect, or the > syscall will be executed, and seccomp will execute next. > > Regarding ptrace, I experimented with before and after, and while the > same ABI argument applies, I felt it was easier to debug if I let ptrace > happen for syscalls that are dispatched back to userspace. In addition, > doing it after ptrace makes the code in syscall_exit_work slightly > simpler, since it doesn't require special handling for this feature. I think I'm with Andy that this should be before ptrace(). ptrace() users will attempt to interpret things like they're regular syscalls, and that's definitely not the case.
Re: [PATCH v6 5/9] kernel: Implement selective syscall userspace redirection
On Fri, Sep 04, 2020 at 04:31:43PM -0400, Gabriel Krisman Bertazi wrote: > +struct syscall_user_dispatch { > + char __user *selector; > + unsigned long dispatcher_start; > + unsigned long dispatcher_end; > +}; > +int do_syscall_user_dispatch(struct pt_regs *regs) > +{ > + struct syscall_user_dispatch *sd = ¤t->syscall_dispatch; > + unsigned long ip = instruction_pointer(regs); > + char state; > + > + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end)) > + return 0; If you use {offset,size}, instead of {start,end}, you can write the above like: if (ip - sd->dispatcher_offset < sd->dispatcher_size) return 0; which is just a single branch.
Re: [PATCH v6 2/9] kernel: entry: Support TIF_SYSCAL_INTERCEPT on common entry code
On Fri, Sep 04, 2020 at 04:31:40PM -0400, Gabriel Krisman Bertazi wrote: > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h > index efebbffcd5cc..72ce9ca860c6 100644 > --- a/include/linux/entry-common.h > +++ b/include/linux/entry-common.h > @@ -21,10 +21,6 @@ > # define _TIF_SYSCALL_TRACEPOINT (0) > #endif > > -#ifndef _TIF_SECCOMP > -# define _TIF_SECCOMP(0) > -#endif > - > #ifndef _TIF_SYSCALL_AUDIT > # define _TIF_SYSCALL_AUDIT (0) > #endif Why doesn't this add: #ifndef _TIF_SYSCALL_INTERCEPT #define _TIF_SYSCALL_INTERCEPT (0) #endif ?
Re: [PATCH v6 1/9] kernel: Support TIF_SYSCALL_INTERCEPT flag
On Fri, Sep 04, 2020 at 04:31:39PM -0400, Gabriel Krisman Bertazi wrote: > +static inline void __set_tsk_syscall_intercept(struct task_struct *tsk, > +unsigned int type) > +{ > + tsk->syscall_intercept |= type; > + > + if (tsk->syscall_intercept) > + set_tsk_thread_flag(tsk, TIF_SYSCALL_INTERCEPT); > +} Did the above want to be: unsigned int old = tsk->syscall_intercept; tsk->syscall_intercept |= type; if (!old) set_tsk_thread_flag(tsk, TIF_SYSCALL_INTERCEPT) ? > +static inline void __clear_tsk_syscall_intercept(struct task_struct *tsk, > + unsigned int type) > +{ > + tsk->syscall_intercept &= ~type; > + > + if (tsk->syscall_intercept == 0) > + clear_tsk_thread_flag(tsk, TIF_SYSCALL_INTERCEPT); > +}
Re: [PATCH 1/1] kernel/sched:use the enum code replace of the int variable
On Mon, Sep 07, 2020 at 09:05:02PM +0800, qianjun.ker...@gmail.com wrote: > From: jun qian > > It is hard to understand what the meaning of the value from > the return value of wakeup_preempt_entity, so I fix it. > @@ -6822,9 +6828,9 @@ static unsigned long wakeup_gran(struct sched_entity > *se) > * g > * |<--->|c > * > - * w(c, s1) = -1 > - * w(c, s2) = 0 > - * w(c, s3) = 1 > + * w(c, s1) = NO_NEED_PREEMPT > + * w(c, s2) = MAY_NEED_PREEMPT > + * w(c, s3) = NEED_PREEMPT > * > */ Yeah, I don't think so. The function is a simple C style compare, where negative is less, 0 is equal and positive is more than.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote: > So, I suggest pXX_offset_unlocked() Urgh, no. Elsewhere in gup _unlocked() means it will take the lock itself (get_user_pages_unlocked()) -- although often it seems to mean the lock is already held (git grep _unlocked and marvel). What we want is _lockless().
kcompactd hotplug fail
Hi, While playing with hotplug, I ran into the below: [ 2305.676384] [ cut here ] [ 2305.681543] WARNING: CPU: 1 PID: 15 at kernel/sched/core.c:1924 __set_cpus_allowed_ptr+0x1bd/0x230 [ 2305.691540] Modules linked in: kvm_intel kvm irqbypass rapl intel_cstate intel_uncore [ 2305.700284] CPU: 1 PID: 15 Comm: cpuhp/1 Tainted: GW 5.9.0-rc1-00126-g560d2f906d7e-dirty #392 [ 2305.711349] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013 [ 2305.722803] RIP: 0010:__set_cpus_allowed_ptr+0x1bd/0x230 [ 2305.728732] Code: ba 00 02 00 00 48 c7 c6 20 78 9e 82 4c 89 ef e8 19 ec 5f 00 85 c0 0f 85 5e ff ff ff 83 bb 60 03 00 00 01 0f 84 51 ff ff ff 90 <0f> 0b 90 e9 48 ff ff ff 83 bd 10 0a 00 00 02 48 89 5c 24 10 44 89 [ 2305.749687] RSP: :c900033dbdd8 EFLAGS: 00010002 [ 2305.755518] RAX: RBX: 88842c33cbc0 RCX: 0200 [ 2305.763478] RDX: 0008 RSI: 829e7820 RDI: 83055720 [ 2305.771439] RBP: 88842f43b4c0 R08: 0009 R09: 83055720 [ 2305.779399] R10: 0008 R11: R12: ffea [ 2305.787360] R13: 83055720 R14: 000d R15: 00b6 [ 2305.795321] FS: () GS:88842f48() knlGS: [ 2305.804348] CS: 0010 DS: ES: CR0: 80050033 [ 2305.810760] CR2: CR3: 02810001 CR4: 001706e0 [ 2305.818720] Call Trace: [ 2305.821454] kcompactd_cpu_online+0xa1/0xb0 [ 2305.826119] ? __compaction_suitable+0xa0/0xa0 [ 2305.831079] cpuhp_invoke_callback+0x9a/0x360 [ 2305.835941] cpuhp_thread_fun+0x19d/0x220 [ 2305.840414] ? smpboot_thread_fn+0x1b4/0x280 [ 2305.845178] ? smpboot_thread_fn+0x26/0x280 [ 2305.849842] ? smpboot_register_percpu_thread+0xe0/0xe0 [ 2305.855672] smpboot_thread_fn+0x1d0/0x280 [ 2305.860243] kthread+0x153/0x170 [ 2305.863843] ? kthread_create_worker_on_cpu+0x70/0x70 [ 2305.869482] ret_from_fork+0x22/0x30 [ 2305.873474] irq event stamp: 236 [ 2305.877067] hardirqs last enabled at (235): [] _raw_spin_unlock_irqrestore+0x4c/0x60 [ 2305.887550] hardirqs last disabled at (236): [] __schedule+0xc0/0xb10 [ 2305.896482] softirqs last enabled at (0): [] copy_process+0x889/0x1d40 [ 2305.905611] softirqs last disabled at (0): [<>] 0x0 [ 2305.912602] ---[ end trace e7f6c2a95b741e6b ]--- Given: static int __init kcompactd_init(void) { ... ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "mm/compaction:online", kcompactd_cpu_online, NULL); and: CPUHP_AP_ONLINE_DYN, CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30, CPUHP_AP_X86_HPET_ONLINE, CPUHP_AP_X86_KVM_CLK_ONLINE, CPUHP_AP_ACTIVE, this is somewhat expected behaviour. It tries and set the compaction affinity to include the newly onlined CPU before it is marked active and that's a no-no. Ideally the kcompactd notifier is ran after AP_ACTIVE, not before.
Re: kcompactd hotplug fail
On Thu, Sep 10, 2020 at 04:10:06PM +0200, pet...@infradead.org wrote: > Hi, > > While playing with hotplug, I ran into the below: Ah, it could be I wrecked my kernel bad... :-( I'll let you know if I can reproduce this on a pristine kernel.
Re: kcompactd hotplug fail
On Thu, Sep 10, 2020 at 04:37:45PM +0200, pet...@infradead.org wrote: > On Thu, Sep 10, 2020 at 04:10:06PM +0200, pet...@infradead.org wrote: > > Hi, > > > > While playing with hotplug, I ran into the below: > > Ah, it could be I wrecked my kernel bad... :-( > > I'll let you know if I can reproduce this on a pristine kernel. Yeah, sorry, I'll go search for one of them brown-paper things to wear.
Re: [sparc64] kernel OOPS bisected from "lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state"
On Thu, Sep 10, 2020 at 02:43:13PM +0300, Anatoly Pugachev wrote: > Hello! > > The following git patch 044d0d6de9f50192f9697583504a382347ee95ca > (linux git master branch) introduced the following kernel OOPS upon > kernel boot on my sparc64 T5-2 ldom (VM): https://lkml.kernel.org/r/20200908154157.gv1362...@hirez.programming.kicks-ass.net
Re: [PATCH v2 3/7] arch/x86/amd/ibs: Fix re-arming IBS Fetch
On Thu, Sep 10, 2020 at 10:32:23AM +0200, pet...@infradead.org wrote: > > @@ -363,7 +363,14 @@ perf_ibs_event_update(struct perf_ibs *perf_ibs, > > struct perf_event *event, > > static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs, > > struct hw_perf_event *hwc, u64 config) > > { > > - wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask); > > + u64 _config = (hwc->config | config) & ~perf_ibs->enable_mask; > > + > > + /* On Fam17h, the periodic fetch counter is set when IbsFetchEn is > > changed from 0 to 1 */ > > + if (perf_ibs == &perf_ibs_fetch && boot_cpu_data.x86 >= 0x16 && > > boot_cpu_data.x86 <= 0x18) > > + wrmsrl(hwc->config_base, _config); > A better option would be to use hwc->flags, you're loading from that > line already, so it's guaranteed hot and then you only have a single > branch. Or stick it in perf_ibs near enable_mask, same difference. I fixed it for you. --- struct perf_ibs { struct pmu pmu; /* 0 296 */ /* --- cacheline 4 boundary (256 bytes) was 40 bytes ago --- */ unsigned int msr; /* 296 4 */ /* XXX 4 bytes hole, try to pack */ u64config_mask; /* 304 8 */ u64cnt_mask; /* 312 8 */ /* --- cacheline 5 boundary (320 bytes) --- */ u64enable_mask; /* 320 8 */ u64valid_mask; /* 328 8 */ u64max_period; /* 336 8 */ long unsigned int offset_mask[1]; /* 344 8 */ intoffset_max; /* 352 4 */ unsigned int fetch_count_reset_broken:1; /* 356: 0 4 */ /* XXX 31 bits hole, try to pack */ struct cpu_perf_ibs * pcpu; /* 360 8 */ struct attribute * * format_attrs; /* 368 8 */ struct attribute_group format_group; /* 37640 */ /* --- cacheline 6 boundary (384 bytes) was 32 bytes ago --- */ const struct attribute_group * attr_groups[2]; /* 41616 */ u64(*get_count)(u64);/* 432 8 */ /* size: 440, cachelines: 7, members: 15 */ /* sum members: 432, holes: 1, sum holes: 4 */ /* sum bitfield members: 1 bits, bit holes: 1, sum bit holes: 31 bits */ /* last cacheline: 56 bytes */ }; --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -89,6 +89,7 @@ struct perf_ibs { u64 max_period; unsigned long offset_mask[1]; int offset_max; + unsigned intfetch_count_reset_broken : 1; struct cpu_perf_ibs __percpu*pcpu; struct attribute**format_attrs; @@ -370,7 +371,13 @@ perf_ibs_event_update(struct perf_ibs *p static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs, struct hw_perf_event *hwc, u64 config) { - wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask); + u64 _config = (hwc->config | config) & ~perf_ibs->enable_mask; + + if (perf_ibs->fetch_count_reset_broken) + wrmsrl(hwc->config_base, _config); + + _config |= perf_ibs->enable_mask; + wrmsrl(hwc->config_base, _config); } /* @@ -756,6 +763,13 @@ static __init void perf_event_ibs_init(v { struct attribute **attr = ibs_op_format_attrs; + /* +* Some chips fail to reset the fetch count when it is written; instead +* they need a 0-1 transition of IbsFetchEn. +*/ + if (boot_cpu_data.x86 >= 0x16 && boot_cpu_data.x86 <= 0x18) + perf_ibs_fetch.fetch_count_reset_broken = 1; + perf_ibs_pmu_init(&perf_ibs_fetch, "ibs_fetch"); if (ibs_caps & IBS_CAPS_OPCNT) {
Re: [PATCH v2 3/7] arch/x86/amd/ibs: Fix re-arming IBS Fetch
On Tue, Sep 08, 2020 at 04:47:36PM -0500, Kim Phillips wrote: > Stephane Eranian found a bug in that IBS' current Fetch counter was not > being reset when the driver would write the new value to clear it along > with the enable bit set, and found that adding an MSR write that would > first disable IBS Fetch would make IBS Fetch reset its current count. > > Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12, > 2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when > IbsFetchEn is changed from 0 to 1." > > Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch, > so the driver properly resets the internal counter to 0 and IBS > Fetch starts counting again. > > A family 15h machine tested does not have this problem, and the extra > wrmsr is also not needed on Family 19h, so only do the extra wrmsr on > families 16h through 18h. *groan*... > --- > arch/x86/events/amd/ibs.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c > index 26c36357c4c9..3eb9a55e998c 100644 > --- a/arch/x86/events/amd/ibs.c > +++ b/arch/x86/events/amd/ibs.c > @@ -363,7 +363,14 @@ perf_ibs_event_update(struct perf_ibs *perf_ibs, struct > perf_event *event, > static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs, >struct hw_perf_event *hwc, u64 config) > { > - wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask); > + u64 _config = (hwc->config | config) & ~perf_ibs->enable_mask; > + > + /* On Fam17h, the periodic fetch counter is set when IbsFetchEn is > changed from 0 to 1 */ > + if (perf_ibs == &perf_ibs_fetch && boot_cpu_data.x86 >= 0x16 && > boot_cpu_data.x86 <= 0x18) > + wrmsrl(hwc->config_base, _config); That's an anti-patttern (and for some reason this is the second time in a week I've seen it). This is a fairly hot path and you're adding a whole bunch of loads and branches. Granted, in case you need the extra wrmsr that's probably noise, but supposedly you're going to be fixing this in hardware eventually, and you'll be getting rid of the second wrmsr again. But then you're stuck with the loads and branches. A better option would be to use hwc->flags, you're loading from that line already, so it's guaranteed hot and then you only have a single branch. Or stick it in perf_ibs near enable_mask, same difference. > + _config |= perf_ibs->enable_mask; > + wrmsrl(hwc->config_base, _config); > } > > /* > -- > 2.27.0 >
Re: [PATCH v2 10/11] lockdep: Only trace IRQ edges
On Wed, Sep 02, 2020 at 06:48:30AM -0700, Guenter Roeck wrote: > On 9/2/20 2:12 AM, pet...@infradead.org wrote: > > On Wed, Sep 02, 2020 at 11:09:35AM +0200, pet...@infradead.org wrote: > >> On Tue, Sep 01, 2020 at 09:21:37PM -0700, Guenter Roeck wrote: > >>> [0.00] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4875 > >>> check_flags.part.39+0x280/0x2a0 > >>> [0.00] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled()) > >> > >>> [0.00] [<004cff18>] lock_acquire+0x218/0x4e0 > >>> [0.00] [<00d740c8>] _raw_spin_lock+0x28/0x40 > >>> [0.00] [<009870f4>] p1275_cmd_direct+0x14/0x60 > >> > >> Lol! yes, I can see that going side-ways... let me poke at that. > > > > I suspect this will do. > > > > diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c > > index 889aa602f8d8..7cfe88e30b52 100644 > > --- a/arch/sparc/prom/p1275.c > > +++ b/arch/sparc/prom/p1275.c > > @@ -38,7 +38,7 @@ void p1275_cmd_direct(unsigned long *args) > > unsigned long flags; > > > > local_save_flags(flags); > > - local_irq_restore((unsigned long)PIL_NMI); > > + arch_local_irq_restore((unsigned long)PIL_NMI); > > raw_spin_lock(&prom_entry_lock); > > > > prom_world(1); > > > No, that doesn't help. Even removing that line entirely doesn't help. > The problem seems to be that interrupts are not enabled in the first > place. But why wasn't this a problem before ? Previously every interrupt opt would disable/enable things, now we only update state when something actually changes. Anyway, I'm struggling with qemu-system-sparc64, I've got a sparc64 cross booting to mount, but I'm not seeing this, could you get me your specific qemu cmdline please?
[PATCH] s390/idle: Fix suspicious RCU usage
After commit eb1f00237aca ("lockdep,trace: Expose tracepoints") the lock tracepoints are visible to lockdep and RCU-lockdep is finding a bunch more RCU violations that were previously hidden. Switch the idle->seqcount over to using raw_write_*() to avoid the lockdep annotation and thus the lock tracepoints. Reported-by: Guenter Roeck Signed-off-by: Peter Zijlstra (Intel) --- arch/s390/kernel/idle.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) --- a/arch/s390/kernel/idle.c +++ b/arch/s390/kernel/idle.c @@ -39,14 +39,13 @@ void enabled_wait(void) local_irq_restore(flags); /* Account time spent with enabled wait psw loaded as idle time. */ - /* XXX seqcount has tracepoints that require RCU */ - write_seqcount_begin(&idle->seqcount); + raw_write_seqcount_begin(&idle->seqcount); idle_time = idle->clock_idle_exit - idle->clock_idle_enter; idle->clock_idle_enter = idle->clock_idle_exit = 0ULL; idle->idle_time += idle_time; idle->idle_count++; account_idle_time(cputime_to_nsecs(idle_time)); - write_seqcount_end(&idle->seqcount); + raw_write_seqcount_end(&idle->seqcount); } NOKPROBE_SYMBOL(enabled_wait);
Re: [PATCH 00/12] x86/platform/uv: Updates for UV5
On Tue, Sep 08, 2020 at 08:28:16AM -0700, Mike Travis wrote: > I didn't. If I could figure out how to convert quilt patches into git > commits I might be able to do that? (And I didn't know that diffstats were > needed on the into?) $ git quiltimport Or, for the more enterprising person: $ quilt series | while read file; do git am $file; done Generating a diffstat from a quilt series (when applied): $ quilt diff --combine - | diffstat
[PATCH] sparc64: Fix irqtrace warnings on Ultra-S
On Tue, Sep 08, 2020 at 07:40:23AM -0700, Guenter Roeck wrote: > qemu-system-sparc64 -M sun4u -cpu "TI UltraSparc IIi" -m 512 \ > -initrd rootfs.cpio \ > -kernel arch/sparc/boot/image -no-reboot \ > -append "panic=-1 slub_debug=FZPUA rdinit=/sbin/init console=ttyS0" \ > -nographic -monitor none Thanks I got it. Also enabling DEBUG_LOCKDEP helps (-: --- Subject: sparc64: Fix irqtrace warnings on Ultra-S Recent changes in Lockdep's IRQTRACE broke Ultra-S. In order avoid redundant IRQ state changes, local_irq_restore() lost the ability to trace a disable. Change the code to use local_irq_save() to disable IRQs and then use arch_local_irq_restore() to further disable NMIs. This result in slightly suboptimal code, but given this code uses a global spinlock, performance cannot be its primary purpose. Fixes: 044d0d6de9f5 ("lockdep: Only trace IRQ edges") Reported-by: Guenter Roeck Signed-off-by: Peter Zijlstra (Intel) --- A possible alternative would be: local_save_flags(flags); arch_local_irq_restore((unsigned long)PIL_NMI); if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) trace_hardirqs_off(); which generates optimal code, but is more verbose. arch/sparc/prom/p1275.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c index 889aa602f8d8..e22233fcf741 100644 --- a/arch/sparc/prom/p1275.c +++ b/arch/sparc/prom/p1275.c @@ -37,16 +37,15 @@ void p1275_cmd_direct(unsigned long *args) { unsigned long flags; - local_save_flags(flags); - local_irq_restore((unsigned long)PIL_NMI); + local_irq_save(flags); + arch_local_irq_restore((unsigned long)PIL_NMI); raw_spin_lock(&prom_entry_lock); prom_world(1); prom_cif_direct(args); prom_world(0); - raw_spin_unlock(&prom_entry_lock); - local_irq_restore(flags); + raw_spin_unlock_irqrestore(&prom_entry_lock, flags); } void prom_cif_init(void *cif_handler, void *cif_stack)
Re: [PATCH V2 3/3] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On Mon, Sep 07, 2020 at 06:01:15PM +0200, pet...@infradead.org wrote: > On Fri, Aug 21, 2020 at 12:57:54PM -0700, kan.li...@linux.intel.com wrote: > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > > index 0f3d01562ded..fa08d810dcd2 100644 > > --- a/arch/x86/events/core.c > > +++ b/arch/x86/events/core.c > > @@ -1440,7 +1440,10 @@ static void x86_pmu_start(struct perf_event *event, > > int flags) > > > > cpuc->events[idx] = event; > > __set_bit(idx, cpuc->active_mask); > > - __set_bit(idx, cpuc->running); > > + /* The cpuc->running is only used by the P4 PMU */ > > + if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON) && > > + (boot_cpu_data.x86 == 0xf)) > > + __set_bit(idx, cpuc->running); > > x86_pmu.enable(event); > > perf_event_update_userpage(event); > > } > > Yuck! Use a static_branch() or something. This is a gnarly nest of code > that runs 99.9% of the time to conclude a negative. IOW a complete waste > of cycles for everybody not running a P4 space heater. Better still, move it into p4_pmu_enable_event().
Re: [PATCH -next] fork: silence a false postive warning in __mmdrop
On Tue, Sep 08, 2020 at 12:50:44PM -0400, Qian Cai wrote: > > No, you're talking nonsense. We must not free @mm when > > 'current->active_mm == mm', never. > > Yes, you are right. It still trigger this below on powerpc with today's > linux-next by fuzzing for a while (saw a few times on recent linux-next before > as well but so far mostly reproducible on powerpc here). Any idea? If you can reliably reproduce this, the next thing is to trace mm_count and figure out where it goes side-ways. I suppose we're looking for an 'extra' decrement. Mark tried this for a while but gave up because he couldn't reliably reproduce.
Re: [PATCH v2 4/5] seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support
On Fri, Sep 04, 2020 at 05:32:30PM +0200, Ahmed S. Darwish wrote: > @@ -406,13 +443,20 @@ static inline int read_seqcount_t_retry(const > seqcount_t *s, unsigned start) > return __read_seqcount_t_retry(s, start); > } > > +/* > + * Enforce non-preemptibility for all seqcount_LOCKNAME_t writers. Don't > + * do it for PREEMPT_RT, for the reasons outlined at __SEQ_LOCK(). > + */ > +#define __seq_enforce_writer_non_preemptibility(s) \ > + (!IS_ENABLED(CONFIG_PREEMPT_RT) && __seqcount_lock_preemptible(s)) > + > /** > * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep > * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants > */ > #define raw_write_seqcount_begin(s) \ > do { \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seq_enforce_writer_non_preemptibility(s)) \ > preempt_disable(); \ > \ > raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ > @@ -433,7 +477,7 @@ static inline void raw_write_seqcount_t_begin(seqcount_t > *s) > do { \ > raw_write_seqcount_t_end(__seqcount_ptr(s));\ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seq_enforce_writer_non_preemptibility(s)) \ > preempt_enable(); \ > } while (0) > > @@ -456,7 +500,7 @@ static inline void raw_write_seqcount_t_end(seqcount_t *s) > do { \ > __seqcount_assert_lock_held(s); \ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seq_enforce_writer_non_preemptibility(s)) \ > preempt_disable(); \ > \ > write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ > @@ -483,7 +527,7 @@ static inline void > write_seqcount_t_begin_nested(seqcount_t *s, int subclass) > do { \ > __seqcount_assert_lock_held(s); \ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seq_enforce_writer_non_preemptibility(s)) \ > preempt_disable(); \ > \ > write_seqcount_t_begin(__seqcount_ptr(s)); \ > @@ -504,7 +548,7 @@ static inline void write_seqcount_t_begin(seqcount_t *s) > do { \ > write_seqcount_t_end(__seqcount_ptr(s));\ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seq_enforce_writer_non_preemptibility(s)) \ > preempt_enable(); \ > } while (0) I've replaced the above with the below, afaict there were no users of __seqcount_lock_preemptible() left. --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -228,7 +228,11 @@ __seqprop_##lockname##_sequence(const se static __always_inline bool\ __seqprop_##lockname##_preemptible(const seqcount_##lockname##_t *s) \ { \ - return preemptible; \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ + return preemptible; \ + \ + /* PREEMPT_RT relies on the above LOCK+UNLOCK */\ + return false; \ } \ \ static __always_inline void\
Re: [PATCH v2 2/5] seqlock: Use unique prefix for seqcount_t property accessors
On Fri, Sep 04, 2020 at 05:32:28PM +0200, Ahmed S. Darwish wrote: > static __always_inline seqcount_t * \ > -__seqcount_##lockname##_ptr(seqcount_##lockname##_t *s) > \ > +__seqprop_seqcount_##lockname##_ptr(seqcount_##lockname##_t *s) > \ I did s/__seqprop_seqcount_/__seqprop_/g on all this. seqprop is a contraction of seqcount property, no need to put in yet another seqcount I feel.
Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless
On Tue, Sep 08, 2020 at 11:15:14AM +, eddy...@trendmicro.com wrote: > > From: pet...@infradead.org > > > > I'm now trying and failing to reproduce I can't seem to make it use > > int3 today. It seems to want to use ftrace or refuses everything. I'm > > probably doing it wrong. > > > > You can turn off CONFIG_KPROBES_ON_FTRACE (and also sysctl > debug.kprobes-optimization) to make it use int3 handler I'm fairly sure I used the sysctl like in your original reproducer. I'll try again later.
Re: linux-next: build failure after merge of the tip tree
On Tue, Sep 08, 2020 at 07:12:23PM +1000, Stephen Rothwell wrote: > Hi all, > > After merging the tip tree, today's linux-next build (powerpc > allyesconfig) failed like this: > > ERROR: modpost: too long symbol > ".__tracepoint_iter_pnfs_mds_fallback_pg_get_mirror_count" > [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] > > Caused by commit > > d25e37d89dd2 ("tracepoint: Optimize using static_call()") > > Exported symbols need to be <= (64 - sizeof(Elf_Addr)) long. This is > presumably 56 on 64 bit arches and the above symbol (including the '.') > is 56 characters long. I suppose something like the below ought to cure that. Still, stupid long tracename that. --- diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 3722a10fc46d..81fa0b2f271e 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -154,7 +154,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) #ifdef CONFIG_HAVE_STATIC_CALL #define __DO_TRACE_CALL(name) static_call(tp_func_##name) #else -#define __DO_TRACE_CALL(name) __tracepoint_iter_##name +#define __DO_TRACE_CALL(name) __traceiter_##name #endif /* CONFIG_HAVE_STATIC_CALL */ /* @@ -232,8 +232,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) * poking RCU a bit. */ #define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \ - extern int __tracepoint_iter_##name(data_proto);\ - DECLARE_STATIC_CALL(tp_func_##name, __tracepoint_iter_##name); \ + extern int __traceiter_##name(data_proto); \ + DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name);\ extern struct tracepoint __tracepoint_##name; \ static inline void trace_##name(proto) \ { \ @@ -288,19 +288,19 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) static const char __tpstrtab_##_name[] \ __section(__tracepoints_strings) = #_name; \ extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \ - int __tracepoint_iter_##_name(void *__data, proto); \ + int __traceiter_##_name(void *__data, proto); \ struct tracepoint __tracepoint_##_name __used \ __section(__tracepoints) = {\ .name = __tpstrtab_##_name, \ .key = STATIC_KEY_INIT_FALSE, \ .static_call_key = &STATIC_CALL_KEY(tp_func_##_name), \ .static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \ - .iterator = &__tracepoint_iter_##_name, \ + .iterator = &__traceiter_##_name, \ .regfunc = _reg,\ .unregfunc = _unreg,\ .funcs = NULL };\ __TRACEPOINT_ENTRY(_name); \ - int __tracepoint_iter_##_name(void *__data, proto) \ + int __traceiter_##_name(void *__data, proto)\ { \ struct tracepoint_func *it_func_ptr;\ void *it_func; \ @@ -314,18 +314,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) } while ((++it_func_ptr)->func);\ return 0; \ } \ - DEFINE_STATIC_CALL(tp_func_##_name, __tracepoint_iter_##_name); + DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name); #define DEFINE_TRACE(name, proto, args)\ DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args)); #define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \ EXPORT_SYMBOL_GPL(__tracepoint_##name); \ - EXPORT_SYMBOL_GPL(__tracepoint_iter_##name);\ + EXPORT_SYMBOL_GPL(__traceiter_##name); \ EXPORT_STATIC_CALL_GPL(tp_func_##name) #define EXPORT_TRACEPOINT_SYMBOL(name) \ EXPORT_SYMBOL(__tracepoint_##name); \ - EXPORT_SYMBOL(__tracepoint_iter_##name);\ + EXPORT_SYMBOL(__traceiter_##name); \ EXPORT_STATIC_CALL(tp_func_##name)
Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless
On Thu, Sep 03, 2020 at 10:39:54AM +0900, Masami Hiramatsu wrote: > > There's a bug, that might make it miss it. I have a patch. I'll send it > > shortly. > > OK, I've confirmed that the lockdep warns on kretprobe from INT3 > with your fix. I'm now trying and failing to reproduce I can't seem to make it use int3 today. It seems to want to use ftrace or refuses everything. I'm probably doing it wrong. Could you verify the below actually works? It's on top of the first 16 patches. > Of course make it lockless then warning is gone. > But even without the lockless patch, this warning can be false-positive > because we prohibit nested kprobe call, right? Yes, because the actual nesting is avoided by kprobe_busy, but lockdep can't tell. Lockdep sees a regular lock user and an in-nmi lock user and figures that's a bad combination. --- --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1241,48 +1241,47 @@ void recycle_rp_inst(struct kretprobe_in } NOKPROBE_SYMBOL(recycle_rp_inst); -void kretprobe_hash_lock(struct task_struct *tsk, -struct hlist_head **head, unsigned long *flags) -__acquires(hlist_lock) -{ - unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS); - raw_spinlock_t *hlist_lock; - - *head = &kretprobe_inst_table[hash]; - hlist_lock = kretprobe_table_lock_ptr(hash); - raw_spin_lock_irqsave(hlist_lock, *flags); -} -NOKPROBE_SYMBOL(kretprobe_hash_lock); - static void kretprobe_table_lock(unsigned long hash, unsigned long *flags) __acquires(hlist_lock) { raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); - raw_spin_lock_irqsave(hlist_lock, *flags); + /* +* HACK, due to kprobe_busy we'll never actually recurse, make NMI +* context use a different lock class to avoid it reporting. +*/ + raw_spin_lock_irqsave_nested(hlist_lock, *flags, !!in_nmi()); } NOKPROBE_SYMBOL(kretprobe_table_lock); -void kretprobe_hash_unlock(struct task_struct *tsk, - unsigned long *flags) +static void kretprobe_table_unlock(unsigned long hash, + unsigned long *flags) __releases(hlist_lock) { + raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); + raw_spin_unlock_irqrestore(hlist_lock, *flags); +} +NOKPROBE_SYMBOL(kretprobe_table_unlock); + +void kretprobe_hash_lock(struct task_struct *tsk, +struct hlist_head **head, unsigned long *flags) +__acquires(hlist_lock) +{ unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS); - raw_spinlock_t *hlist_lock; - hlist_lock = kretprobe_table_lock_ptr(hash); - raw_spin_unlock_irqrestore(hlist_lock, *flags); + *head = &kretprobe_inst_table[hash]; + kretprobe_table_lock(hash, flags); } -NOKPROBE_SYMBOL(kretprobe_hash_unlock); +NOKPROBE_SYMBOL(kretprobe_hash_lock); -static void kretprobe_table_unlock(unsigned long hash, - unsigned long *flags) +void kretprobe_hash_unlock(struct task_struct *tsk, + unsigned long *flags) __releases(hlist_lock) { - raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); - raw_spin_unlock_irqrestore(hlist_lock, *flags); + unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS); + kretprobe_table_unlock(hash, flags); } -NOKPROBE_SYMBOL(kretprobe_table_unlock); +NOKPROBE_SYMBOL(kretprobe_hash_unlock); struct kprobe kprobe_busy = { .addr = (void *) get_kprobe,
Re: [RFC PATCH] sched: only issue an audit on privileged operation
On Fri, Sep 04, 2020 at 06:00:31PM +0200, Christian Göttsche wrote: > sched_setattr(2) does via kernel/sched/core.c:__sched_setscheduler() > issue a CAP_SYS_NICE audit event unconditionally, even when the requested > operation does not require that capability / is un-privileged. > > Perform privilged/unprivileged catigorization first and perform a > capable test only if needed. > > Signed-off-by: Christian Göttsche > --- > kernel/sched/core.c | 65 - > 1 file changed, 47 insertions(+), 18 deletions(-) So who sodding cares about audit, and why is that a reason to make a trainwreck of code?
Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity
On Mon, Sep 07, 2020 at 11:48:45AM +0100, Qais Yousef wrote: > IMHO the above is a hack. Out-of-tree modules should rely on public headers > and > exported functions only. What you propose means that people who want to use > these tracepoints in meaningful way must have a prebuilt kernel handy. Which > is > maybe true for us who work in the embedded world. But users who run normal > distro kernels (desktop/servers) will fail to build against But this isn't really aimed at regular users. We're aiming this at developers (IIUC) so I dont really see this as a problem. > FWIW, I did raise this concern with Peter in 2019 OSPM and he was okay with > the > exports as it's still not a contract and they can disappear anytime we want. > Migrating to using BTF is the right way forward IMO. I don't think what we > have > here is out-of-control yet. Though I agree they're annoying. Right, we're hiding behind the explicit lack of ABI for modules. Anyway, CTF/BTF/random other crap that isn't DWARFs should work fine to replace all this muck. Just no idea what the state of any of that is.
Re: [PATCH v1 6/8] x86/tsc: Use seqcount_latch_t
On Mon, Sep 07, 2020 at 06:29:13PM +0200, Ahmed S. Darwish wrote: > I've been unsuccessful in reproducing this huge, 200+ bytes, difference. > Can I please get the defconfig and GCC version? I think I lost the config and it's either gcc-9.3 or gcc-10, I can't remember. I just tried with: make defconfig ./scripts/config --enable PREEMPT --enable DEBUG_ATOMIC_SLEEP make oldconfig And that reproduces things a little, but nowhere near as horrible as I reported. Clearly I had something mad enabled by accident. > Here are the two competing implementations: > > noinline void cyc2ns_read_begin_v1(struct cyc2ns_data *data) > { > seqcount_latch_t *seqcount; > int seq, idx; > > preempt_disable_notrace(); > > seqcount = &this_cpu_ptr(&cyc2ns)->seq; > do { > seq = raw_read_seqcount_latch(seqcount); > idx = seq & 1; > > data->cyc2ns_offset = > this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); > data->cyc2ns_mul= > this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); > data->cyc2ns_shift = > this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); > > } while (read_seqcount_latch_retry(seqcount, seq)); > } > > noinline void cyc2ns_read_begin_v2(struct cyc2ns_data *data) > { > int seq, idx; > > preempt_disable_notrace(); > > do { > seq = this_cpu_read(cyc2ns.seq.seqcount.sequence); > idx = seq & 1; > > data->cyc2ns_offset = > this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); > data->cyc2ns_mul= > this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); > data->cyc2ns_shift = > this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); > > } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence))); > } Don't look at this function in isolation, look at native_sched_clock() where it's used as a whole. What happened (afaict) is that the change caused it to use more registers and ended up spiling crap on the stack. GCC-9.3 gives me: (this_cpu variant) 0c00 : c00: e9 65 00 00 00 jmpq c6a 0005 c05: 0f 31 rdtsc 0007 c07: 48 c1 e2 20 shl$0x20,%rdx 000b c0b: 48 09 c2or %rax,%rdx 000e c0e: 65 ff 05 00 00 00 00incl %gs:0x0(%rip)# c15 0011c11: R_X86_64_PC32 __preempt_count-0x4 0015 c15: 65 44 8b 05 00 00 00mov%gs:0x0(%rip),%r8d# c1d 001c c1c: 00 0019c19: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c 001d c1d: 44 89 c0mov%r8d,%eax 0020 c20: 83 e0 01and$0x1,%eax 0023 c23: 48 c1 e0 04 shl$0x4,%rax 0027 c27: 48 8d 88 00 00 00 00lea0x0(%rax),%rcx 002ac2a: R_X86_64_32S .data..percpu..shared_aligned 002e c2e: 65 48 8b 79 08 mov%gs:0x8(%rcx),%rdi 0033 c33: 65 8b b0 00 00 00 00mov%gs:0x0(%rax),%esi 0036c36: R_X86_64_32S .data..percpu..shared_aligned 003a c3a: 65 8b 49 04 mov%gs:0x4(%rcx),%ecx 003e c3e: 65 8b 05 00 00 00 00mov%gs:0x0(%rip),%eax# c45 0041c41: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c 0045 c45: 41 39 c0cmp%eax,%r8d 0048 c48: 75 cb jnec15 004a c4a: 89 f0 mov%esi,%eax 004c c4c: 48 f7 e2mul%rdx 004f c4f: 48 0f ad d0 shrd %cl,%rdx,%rax 0053 c53: 48 d3 eashr%cl,%rdx 0056 c56: f6 c1 40test $0x40,%cl 0059 c59: 48 0f 45 c2 cmovne %rdx,%rax 005d c5d: 48 01 f8add%rdi,%rax 0060 c60: 65 ff 0d 00 00 00 00decl %gs:0x0(%rip)# c67 0063c63: R_X86_64_PC32 __preempt_count-0x4 0067 c67: 74 1a je c83 0069 c69: c3 retq 006a c6a: 48 69 05 00 00 00 00imul $0xf4240,0x0(%rip),%rax# c75 0071 c71: 40 42 0f 00 006dc6d: R_X86_64_PC32 jiffies_64-0x8 0075 c75: 48 ba 00 b8 64 d9 05movabs $0xfff0be05d964b800,%rdx 007c c7c: be f0 ff 007f c7f: 48 01 d0add%rdx,%rax 0082 c82: c3 retq 0083 c83: 55 push %rbp 0084 c84: 48 89 e5mov%rsp,%rbp 0087 c87: e8 00 00 00 00 callq c8c 0088c88: R_X86_64_PLT32 preempt_schedule_notrace_thunk-0x4 008c c8c: 5d pop%rbp 008d c8d: c3 retq (seqcount_latch variant) 0c20 : c20: e9 89 00 00 00 jmpq cae 0005 c25: 55
Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context
On Mon, Sep 07, 2020 at 02:03:09PM +0200, Joerg Vehlow wrote: > > > On 9/7/2020 1:46 PM, pet...@infradead.org wrote: > > I think it's too complicated for that is needed, did you see my > > suggestion from a year ago? Did i miss something obvious? > > > This one? > https://lore.kernel.org/linux-fsdevel/20191219090535.gv2...@hirez.programming.kicks-ass.net/ > > I think it may be a bit incorrect? > According to the original comment in __crash_kexec, the mutex was used to > prevent a sys_kexec_load, while crash_kexec is executed. Your proposed patch > does not lock the mutex in crash_kexec. Sure, but any mutex taker will (spin) wait for panic_cpu==CPU_INVALID. And if the mutex is already held, we'll not run __crash_kexec() just like the trylock() would do today. > This does not cover the original use > case anymore. The only thing that is protected now are two panicing cores at > the same time. I'm not following. AFAICT it does exactly what the old code did. Although maybe I didn't replace all kexec_mutex users, I now see that thing isn't static. > Actually, this implementation feels even more hacky to me It's more minimal ;-) It's simpler in that it only provides the required semantics (as I understand them) and does not attempt to implement a more general trylock() like primitive that isn't needed. Also, read the kexec_lock() implementation you posted and explain to me what happens when kexec_busy is elevated. Also note the lack of confusing loops in my code.
Re: [PATCH V2 3/3] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On Fri, Aug 21, 2020 at 12:57:54PM -0700, kan.li...@linux.intel.com wrote: > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 0f3d01562ded..fa08d810dcd2 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -1440,7 +1440,10 @@ static void x86_pmu_start(struct perf_event *event, > int flags) > > cpuc->events[idx] = event; > __set_bit(idx, cpuc->active_mask); > - __set_bit(idx, cpuc->running); > + /* The cpuc->running is only used by the P4 PMU */ > + if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON) && > + (boot_cpu_data.x86 == 0xf)) > + __set_bit(idx, cpuc->running); > x86_pmu.enable(event); > perf_event_update_userpage(event); > } Yuck! Use a static_branch() or something. This is a gnarly nest of code that runs 99.9% of the time to conclude a negative. IOW a complete waste of cycles for everybody not running a P4 space heater. > @@ -1544,6 +1547,9 @@ static void x86_pmu_del(struct perf_event *event, int > flags) > if (cpuc->txn_flags & PERF_PMU_TXN_ADD) > goto do_del; > > + if (READ_ONCE(x86_pmu.attr_rdpmc) && x86_pmu.sched_task && > + test_bit(event->hw.idx, cpuc->active_mask)) > + __set_bit(event->hw.idx, cpuc->dirty); And that too seems like an overly complicated set of tests and branches. This should be effectivly true for the 99% common case. > @@ -2219,11 +2225,45 @@ static int x86_pmu_event_init(struct perf_event > *event) > return err; > } > > +void x86_pmu_clear_dirty_counters(void) > +{ > + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); > + int i; > + > + if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX)) > + return; > + > + /* Don't need to clear the assigned counter. */ > + for (i = 0; i < cpuc->n_events; i++) > + __clear_bit(cpuc->assign[i], cpuc->dirty); > + > + for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) { > + /* Metrics events don't have corresponding HW counters. */ > + if (is_metric_idx(i)) > + continue; > + else if (i >= INTEL_PMC_IDX_FIXED) > + wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - > INTEL_PMC_IDX_FIXED), 0); > + else > + wrmsrl(x86_pmu_event_addr(i), 0); > + } > + > + bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX); > +} The bitmap_{empty,zero}() do indeed compile into single instructions, neat! > static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct > *mm) > { > if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) > return; > > + /* > + * Enable sched_task() for the RDPMC task, > + * and clear the existing dirty counters. > + */ > + if (x86_pmu.sched_task && event->hw.target && > !is_sampling_event(event)) { > + perf_sched_cb_inc(event->ctx->pmu); > + x86_pmu_clear_dirty_counters(); > + } I'm failing to see the correctness of the !is_sampling_event() part there. > /* >* This function relies on not being called concurrently in two >* tasks in the same mm. Otherwise one task could observe > @@ -2246,6 +2286,9 @@ static void x86_pmu_event_unmapped(struct perf_event > *event, struct mm_struct *m > if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) > return; > > + if (x86_pmu.sched_task && event->hw.target && !is_sampling_event(event)) > + perf_sched_cb_dec(event->ctx->pmu); > + Idem. > if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed)) > on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1); > } > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index c72e4904e056..e67713bfa33a 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -4166,11 +4166,39 @@ static void intel_pmu_cpu_dead(int cpu) > intel_cpuc_finish(&per_cpu(cpu_hw_events, cpu)); > } > > +static void intel_pmu_rdpmc_sched_task(struct perf_event_context *ctx) > +{ > + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); > + struct perf_event *event; > + > + if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX)) > + return; > + > + /* > + * If the new task has the RDPMC enabled, clear the dirty counters to > + * prevent the potential leak. If the new task doesn't have the RDPMC > + * enabled, do nothing. > + */ > + list_for_each_entry(event, &ctx->event_list, event_entry) { > + if (event->hw.target && > + (event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED) && > + !is_sampling_event(event) && > + atomic_read(&event->mmap_count)) > + break; > + } > + if (&event->event_entry == &ctx->event_list) > + return; That's horrific, what's wrong with something like: if (!atomic_re
Re: Requirements to control kernel isolation/nohz_full at runtime
(your mailer broke and forgot to keep lines shorter than 78 chars) On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote: > == TIF_NOHZ == > > Need to get rid of that in order not to trigger syscall slowpath on > CPUs that don't want nohz_full. Also we don't want to iterate all > threads and clear the flag when the last nohz_full CPU exits nohz_full > mode. Prefer static keys to call context tracking on archs. x86 does > that well. Build on the common entry code I suppose. Then any arch that uses that gets to have the new features. > == Proper entry code == > > We must make sure that a given arch never calls exception_enter() / > exception_exit(). This saves the previous state of context tracking > and switch to kernel mode (from context tracking POV) temporarily. > Since this state is saved on the stack, this prevents us from turning > off context tracking entirely on a CPU: The tracking must be done on > all CPUs and that takes some cycles. > > This means that, considering early entry code (before the call to > context tracking upon kernel entry, and after the call to context > tracking upon kernel exit), we must take care of few things: > > 1) Make sure early entry code can't trigger exceptions. Or if it does, > the given exception can't schedule or use RCU (unless it calls > rcu_nmi_enter()). Otherwise the exception must call > exception_enter()/exception_exit() which we don't want. I think this is true for x86. Early entry has interrupts disabled, any exception that can still happen is NMI-like and will thus use rcu_nmi_enter(). On x86 that now includes #DB (which is also excluded due to us refusing to set execution breakpoints on entry code), #BP, NMI and MCE. > 2) No call to schedule_user(). I'm not sure what that is supposed to do, but x86 doesn't appear to have it, so all good :-) > 3) Make sure early entry code is not interruptible or > preempt_schedule_irq() would rely on > exception_entry()/exception_exit() This is so for x86. > 4) Make sure early entry code can't be traced (no call to > preempt_schedule_notrace()), or if it does it can't schedule noinstr is your friend. > I believe x86 does most of that well. It does now.
Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context
On Mon, Sep 07, 2020 at 12:51:37PM +0200, Joerg Vehlow wrote: > Hi, > > I guess there is currently no other way than to use something like Steven > proposed. I implemented and tested the attached patch with a module, > that triggers the soft lockup detection and it works as expected. > I did not use inline functions, but normal function implemented in > kexec_core, > because there is nothing time critical here. > I also added the mutex_lock to the trylock variant, because then the unlock > function can be the same for both lock functions. > > What do you think? I think it's too complicated for that is needed, did you see my suggestion from a year ago? Did i miss something obvious?
Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context
On Sat, Aug 22, 2020 at 07:49:28PM -0400, Steven Rostedt wrote: > From this email: > > > The problem happens when that owner is the idle task, this can happen > > when the irq/softirq hits the idle task, in that case the contending > > mutex_lock() will try and PI boost the idle task, and that is a big > > no-no. > > What's wrong with priority boosting the idle task? It's not obvious, > and I can't find comments in the code saying it would be bad. > The idle task is not mentioned at all in rtmutex.c and not mentioned in > kernel/locking except for some comments about RCU in lockdep. There used to be a giant BUG() and comment somewhere in the PI code I think.. but that's vage memories. > I see that in the idle code the prio_change method does a BUG(), but > there's no comment to say why it does so. > > The commit that added that BUG, doesn't explain why it can't happen: > > a8941d7ec8167 ("sched: Simplify the idle scheduling class") That's like almost a decade ago ... > I may have once known the rationale behind all this, but it's been a > long time since I worked on the PI code, and it's out of my cache. I suffer much the same problem. So cenceptually there's the problem that idle must always be runnable, and the moment you boost it, it becomes subject to a different scheduling class. Imagine for example what happens when we boost it to RT and then have it be subject to throttling. What are we going to run when the idle task is no longer elegible to run. (it might all work out by accident, but ISTR we had a whole bunch of fun in the earlier days of RT due to things like that)
Re: [PATCH -v2] scipts/tags.sh: Add custom sort order
On Thu, Sep 03, 2020 at 09:26:04AM +0200, pet...@infradead.org wrote: > On Thu, Sep 03, 2020 at 11:07:28AM +0900, Masahiro Yamada wrote: > > Will re-implementing your sorting logic > > in bash look cleaner? > > Possibly, I can try, we'll see. It is somewhat cleaner, but it is _abysmally_ slow. Bash sucks :-( It is still broken in all the same ways as before, I figured I'd get it 'working' first. --- diff --git a/scripts/tags.sh b/scripts/tags.sh index 32d3f53af10b..ec2688b3441a 100755 --- a/scripts/tags.sh +++ b/scripts/tags.sh @@ -239,10 +239,65 @@ setup_regex() done } +sort_tags() +{ + export LC_ALL=C + + # start concurrent sort + coproc sort + # HACK, clone sort output into 3 to ensure we can still read it + # after sort terminates + exec 3<&${COPROC[0]} + + while read tag file rest; + do + local tmp=${rest#*;\"} + + case "${tmp:1:1}" in # Precedence for 'C' kinds + + c) order="A";; # classes + s) order="B";; # structure names + t) order="C";; # typedefs + g) order="D";; # enumeration names + u) order="E";; # union names + n) order="F";; # namespaces + + f) order="G";; # function definitions + p) order="H";; # function prototypes + d) order="I";; # macro definitions + + e) order="J";; # enumerators (values inside an enumeration) + m) order="K";; # class, struct and union members + v) order="L";; # variable definitions + + l) order="M";; # local variables [off] + x) order="N";; # external and forward variable declarations + + *) order="Z";; + + esac + + # write to sort with a new sort-key prepended + echo "${tag}${order}${tag} ${file} ${rest}" >&${COPROC[1]} + done + + # close sort input + exec {COPROC[1]}>&- + + # consume sort output + while read -u 3 key line; + do + # strip the sort-key + echo "${line}" + done +} + exuberant() { + ( + setup_regex exuberant asm c - all_target_sources | xargs $1 -a\ + all_target_sources | xargs $1 \ -I __initdata,__exitdata,__initconst,__ro_after_init\ -I __initdata_memblock \ -I __refdata,__attribute,__maybe_unused,__always_unused \ @@ -256,12 +311,16 @@ exuberant() -I DEFINE_TRACE,EXPORT_TRACEPOINT_SYMBOL,EXPORT_TRACEPOINT_SYMBOL_GPL \ -I static,const \ --extra=+fq --c-kinds=+px --fields=+iaS --langmap=c:+.h \ + --sort=no -o - \ "${regex[@]}" setup_regex exuberant kconfig - all_kconfigs | xargs $1 -a \ + all_kconfigs | xargs $1 \ + --sort=no -o - \ --langdef=kconfig --language-force=kconfig "${regex[@]}" + ) | sort_tags > tags + } emacs()
Re: [PATCH] MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
On Fri, Sep 04, 2020 at 11:26:23AM +0200, Daniel Bristot de Oliveira wrote: > diff --git a/MAINTAINERS b/MAINTAINERS > index 592467ba3f4d..56d185210a43 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -15363,6 +15363,7 @@ R:Dietmar Eggemann > (SCHED_NORMAL) > R: Steven Rostedt (SCHED_FIFO/SCHED_RR) > R: Ben Segall (CONFIG_CFS_BANDWIDTH) > R: Mel Gorman (CONFIG_NUMA_BALANCING) > +R: Daniel Bristot de Oliveira (SCHED_DEADLINE) > L: linux-kernel@vger.kernel.org > S: Maintained > T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core Absolutely, great to have you looking over that!
Re: [PATCH v2 05/28] objtool: Add a pass for generating __mcount_loc
On Thu, Sep 03, 2020 at 03:03:30PM -0700, Sami Tolvanen wrote: > On Thu, Sep 3, 2020 at 2:51 PM Kees Cook wrote: > > > > On Thu, Sep 03, 2020 at 01:30:30PM -0700, Sami Tolvanen wrote: > > > From: Peter Zijlstra > > > > > > Add the --mcount option for generating __mcount_loc sections > > > needed for dynamic ftrace. Using this pass requires the kernel to > > > be compiled with -mfentry and CC_USING_NOP_MCOUNT to be defined > > > in Makefile. > > > > > > Link: > > > https://lore.kernel.org/lkml/20200625200235.gq4...@hirez.programming.kicks-ass.net/ > > > Signed-off-by: Peter Zijlstra > > > > Hmm, I'm not sure why this hasn't gotten picked up yet. Is this expected > > to go through -tip or something else? > > Note that I picked up this patch from Peter's original email, to which > I included a link in the commit message, but it wasn't officially > submitted as a patch. However, the previous discussion seems to have > died, so I included the patch in this series, as it cleanly solves the > problem of whitelisting non-call references to __fentry__. I was > hoping for Peter and Steven to comment on how they prefer to proceed > here. Right; so I'm obviously fine with this patch and I suppose I can pick it (and the next) into tip/objtool/core, provided Steve is okay with this approach.
Re: [PATCH v2 00/28] Add support for Clang LTO
Please don't nest series! Start a new thread for every posting.