from:"peterz"

Re: [PATCH] page_alloc: Fix freeing non-compound pages

2020-09-24 Thread peterz

On Tue, Sep 22, 2020 at 03:00:17PM +0100, Matthew Wilcox (Oracle) wrote:
> Here is a very rare race which leaks memory:
> 
> Page P0 is allocated to the page cache.
> Page P1 is free.
> 
> Thread A  Thread BThread C
> find_get_entry():
> xas_load() returns P0
>   Removes P0 from page cache
>   Frees P0
>   P0 merged with its buddy P1
>   alloc_pages(GFP_KERNEL, 1) returns P0
>   P0 has refcount 1
> page_cache_get_speculative(P0)
> P0 has refcount 2
>   __free_pages(P0)
>   P0 has refcount 1
> put_page(P0)
> P1 is not freed
> 
> Fix this by freeing all the pages in __free_pages() that won't be freed
> by the call to put_page().  It's usually not a good idea to split a page,
> but this is a very unlikely scenario.
> 
> Fixes: e286781d5f2e ("mm: speculative page references")
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/page_alloc.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fab5e97dc9ca..5db74797db39 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4943,10 +4943,19 @@ static inline void free_the_page(struct page *page, 
> unsigned int order)
>   __free_pages_ok(page, order);
>  }
>  
> +/*
> + * If we free a non-compound allocation, another thread may have a
> + * speculative reference to the first page.  It has no way of knowing
> + * about the rest of the allocation, so we have to free all but the
> + * first page here.
> + */
>  void __free_pages(struct page *page, unsigned int order)
>  {
>   if (put_page_testzero(page))
>   free_the_page(page, order);
> + else if (!PageHead(page))
> + while (order-- > 0)
> + free_the_page(page + (1 << order), order);
>  }
>  EXPORT_SYMBOL(__free_pages);

So the obvious question I have here is why not teach put_page() to free
the whole thing?

Re: [PATCH] latency improvement in __smp_call_single_queue

2020-09-24 Thread peterz

On Wed, Sep 23, 2020 at 10:00:41AM -0500, George Prekas wrote:
> If an interrupt arrives between llist_add and
> send_call_function_single_ipi in the following code snippet, then the
> remote CPU will not receive the IPI in a timely manner and subsequent
> SMP calls even from other CPUs for other functions will be delayed:
> 
>     if (llist_add(node, &per_cpu(call_single_queue, cpu)))
>     send_call_function_single_ipi(cpu);
> 
> Note: llist_add returns 1 if it was empty before the operation.
> 
> CPU 0   | CPU 1 | CPU 2
> __smp_call_single_q(2,f1)   | __smp_call_single_q(2,f2) |
>   llist_add returns 1   |   |
>   interrupted   |   llist_add returns 0 |
>   ...   |   branch not taken    |
>   ...   |   |
>   resumed   |   |
>   send_call_function_single_ipi |   |
>     |   | f1
>     |   | f2
> 
> The call from CPU 1 for function f2 will be delayed because CPU 0 was
> interrupted.

Do you happen to have any actual numbers and a use-case where this was
relevant?

Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread peterz

On Wed, Sep 23, 2020 at 02:11:23PM -0400, Nitesh Narayan Lal wrote:
> Introduce a new API hk_num_online_cpus(), that can be used to
> retrieve the number of online housekeeping CPUs that are meant to handle
> managed IRQ jobs.
> 
> This API is introduced for the drivers that were previously relying only
> on num_online_cpus() to determine the number of MSIX vectors to create.
> In an RT environment with large isolated but fewer housekeeping CPUs this
> was leading to a situation where an attempt to move all of the vectors
> corresponding to isolated CPUs to housekeeping CPUs were failing due to
> per CPU vector limit.
> 
> Signed-off-by: Nitesh Narayan Lal 
> ---
>  include/linux/sched/isolation.h | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index cc9f393e2a70..2e96b626e02e 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags 
> flags)
>   return true;
>  }
>  
> +static inline unsigned int hk_num_online_cpus(void)

This breaks with the established naming of that header.

Re: [PATCH 0/2] measure latency of cpu hotplug path

2020-09-24 Thread peterz

On Wed, Sep 23, 2020 at 04:37:44PM -0700, Prasad Sodagudi wrote:
> There are all changes related to cpu hotplug path and would like to seek
> upstream review. These are all patches in Qualcomm downstream kernel
> for a quite long time. First patch sets the rt prioity to hotplug
> task and second patch adds cpuhp trace events.
> 
> 1) cpu-hotplug: Always use real time scheduling when hotplugging a CPU
> 2) cpu/hotplug: Add cpuhp_latency trace event

Why? Hotplug is a known super slow path. If you care about hotplug
latency you're doing it wrong.

Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-24 Thread peterz

On Wed, Sep 23, 2020 at 11:52:51AM -0400, Steven Rostedt wrote:
> On Wed, 23 Sep 2020 10:40:32 +0200
> pet...@infradead.org wrote:
> 
> > However, with migrate_disable() we can have each task preempted in a
> > migrate_disable() region, worse we can stack them all on the _same_ CPU
> > (super ridiculous odds, sure). And then we end up only able to run one
> > task, with the rest of the CPUs picking their nose.
> 
> What if we just made migrate_disable() a local_lock() available for !RT?

Can't, neiter migrate_disable() nor migrate_enable() are allowed to
block -- which is what makes their implementation so 'interesting'.

> This should lower the SHC in theory, if you can't have stacked migrate
> disables on the same CPU.

See this email in that other thread (you're on Cc too IIRC):

https://lkml.kernel.org/r/20200923170809.gy1362...@hirez.programming.kicks-ass.net

I think that is we 'frob' the balance PULL, we'll end up with something
similar.

Whichever way around we turn this thing, the migrate_disable() runtime
(we'll have to add a tracer for that), will be an interference term on
the lower priority task, exactly like preempt_disable() would be. We'll
just not exclude a higher priority task from running.

AFAICT; the best case is a single migrate_disable() nesting, where a
higher priority task preempts in a migrate_disable() section -- this is
per design.

When this preempted task becomes elegible to run under the ideal model
(IOW it becomes one of the M highest priority tasks), it might still
have to wait for the preemptee's migrate_disable() section to complete.
Thereby suffering interference in the exact duration of
migrate_disable() section.

Per this argument, the change from preempt_disable() to
migrate_disable() gets us:

 - higher priority tasks gain reduced wake-up latency
 - lower priority tasks are unchanged and are subject to the exact same
   interference term as if the higher priority task were using
   preempt_disable().

Since we've already established this term is unbounded, any task but the
highest priority task is basically buggered.

TL;DR, if we get balancing fixed and achieve (near) the optimal case
above, migrate_disable() is an over-all win. But it's provably
non-deterministic as long as the migrate_disable() sections are
non-deterministic.

The reason this all mostly works in practise is (I think) because:

 - People care most about the higher prio RT tasks and craft them to
   mostly avoid the migrate_disable() infected code.

 - The preemption scenario is most pronounced at higher utilization
   scenarios, and I suspect this is fairly rare to begin with.

 - And while many of these migrate_disable() regions are unbound in
   theory, in practise they're often fairly reasonable.

So my current todo list is:

 - Change RT PULL
 - Change DL PULL
 - Add migrate_disable() tracer; exactly like preempt/irqoff, except
   measuring task-runtime instead of cpu-time.
 - Add a mode that measures actual interference.
 - Add a traceevent to detect preemption in migrate_disable().

And then I suppose I should twist Daniel's arm to update his model to
include these scenarios and numbers.

Re: possible deadlock in xfrm_policy_delete

2020-09-24 Thread peterz

On Thu, Sep 24, 2020 at 06:44:12AM +0200, Dmitry Vyukov wrote:
> On Thu, Sep 24, 2020 at 6:36 AM Herbert Xu  
> wrote:
> > >  (k-slock-AF_INET6){+.-.}-{2:2}

That's a seqlock.

> > What's going on with all these bogus lockdep reports?
> >
> > These are two completely different locks, one is for TCP and the
> > other is for SCTP.  Why is lockdep suddenly beoming confused about
> > this?
> >
> > FWIW this flood of bogus reports started on 16/Sep.
> 
> 
> FWIW one of the dups of this issue was bisected to:
> 
> commit 1909760f5fc3f123e47b4e24e0ccdc0fc8f3f106
> Author: Ahmed S. Darwish 
> Date:   Fri Sep 4 15:32:31 2020 +
> 
> seqlock: PREEMPT_RT: Do not starve seqlock_t writers
> 
> Can it be related?

Did that tree you're testing include 267580db047e ("seqlock: Unbreak
lockdep") ?

Re: [PATCH 7/9] sched: Add migrate_disable()

2020-09-23 Thread peterz

On Wed, Sep 23, 2020 at 10:31:10AM +0200, Thomas Gleixner wrote:
>In practice migrate disable could be taken into account on placement
>decisions, but yes we don't have anything like that at the moment.

I think at the very least we should do some of that.

The premise is wanting to run the M highest priority tasks, when a CPU
drops priority, it tries to PULL a higher priority task towards itself.
If this PULL selects a migrate_disable() tasks, it means the task is in
the M highest prio tasks.

Since obviously that task cannot get pulled, we should then pull the
current running task of that CPU, this would then allow the
migrate_disable() task to resume execution.

I'll go try and cook up something along those lines...

Re: [PATCH] sched/fair: Use bool parameter for update_tg_load_avg()

2020-09-23 Thread peterz

On Wed, Sep 23, 2020 at 09:29:35PM +0800, Xianting Tian wrote:
> In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
> sometimes update_tg_load_avg(cfs_rq, false) is used. So change it
> to use bool parameter.

afaict it's never true (or my git-grep failed), so why not remove the
argument entirely?

Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-23 Thread peterz

On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
> Alternatively this could of course be solved with per CPU page tables
> which will come around some day anyway I fear.

Previously (with PTI) we looked at making the entire kernel map per-CPU,
and that takes a 2K copy on switch_mm() (or more general, the user part
of whatever the top level directory is for architectures that have a
shared kernel/user page-table setup in the first place).

The idea was having a fixed per-cpu kernel page-table, share a bunch of
(kernel) page-tables between all CPUs and then copy in the user part on
switch.

I've forgotten what the plan was for ASID/PCID in that scheme.

For x86_64 we've been fearing the performance of that 2k copy, but I
don't think we've ever actually bit the bullet and implemented it to see
how bad it really is.

Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-23 Thread peterz

On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
> > On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner  wrote:
> >>
> >> If a task is migrated to a different CPU then the mapping address will
> >> change which will explode in colourful ways.
> >
> > Right you are.
> >
> > Maybe we really *could* call this new kmap functionality something
> > like "kmap_percpu()" (or maybe "local" is good enough), and make it
> > act like your RT code does for spinlocks - not disable preemption, but
> > only disabling CPU migration.
> 
> I"m all for it, but the scheduler people have opinions :)

Right, so I'm concerned. migrate_disable() wrecks pretty much all
Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is
somewhat ironic.

Yes, it allows breaking up non-preemptible regions of non-deterministic
duration, and thereby both reduce and bound the scheduling latency, the
cost for doing that is that the theory on CPU utilization/bandwidth go
out the window.

To easily see this consider an SMP system with a number of tasks equal
to the number of CPUs. On a regular (preempt_disable) kernel we can
always run each task, by virtue of always having an idle CPU to take the
task.

However, with migrate_disable() we can have each task preempted in a
migrate_disable() region, worse we can stack them all on the _same_ CPU
(super ridiculous odds, sure). And then we end up only able to run one
task, with the rest of the CPUs picking their nose.

The end result is that, like with unbounded latency, we will still miss
our deadline, simply because we got starved for CPU.

Now, while we could (with a _lot_ of work) rework the kernel to not rely
on the implicit per-cpu ness of things like spinlock_t, the moment we
bring in basic primitives that rely on migrate_disable() we're stuck
with it.

The problem is; afaict there's been no research into this problem. There
might be scheduling (read: balancing) schemes that can mitigate/solve
this problem, or it might prove to be a 'hard' problem, I just don't
know. But once we start down this road, it's going to be hell to get rid
of it.

That's why I've been arguing (for many years) to strictly limit this to
PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build
on.

Re: [PATCH 7/9] sched: Add migrate_disable()

2020-09-23 Thread peterz

On Mon, Sep 21, 2020 at 09:16:54PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 21 2020 at 18:36, Peter Zijlstra wrote:

> > +/*
> > + * Migrate-Disable and why it is (strongly) undesired.
> > + *
> > + * The premise of the Real-Time schedulers we have on Linux
> > + * (SCHED_FIFO/SCHED_DEADLINE) is that M CPUs can/will run M tasks
> > + * concurrently, provided there are sufficient runnable tasks, also known 
> > as
> > + * work-conserving. For instance SCHED_DEADLINE tries to schedule the M
> > + * earliest deadline threads, and SCHED_FIFO the M highest priority 
> > threads.
> > + *
> > + * The correctness of various scheduling models depends on this, but is it
> > + * broken by migrate_disable() that doesn't imply preempt_disable(). Where
> > + * preempt_disable() implies an immediate priority ceiling, preemptible
> > + * migrate_disable() allows nesting.
> > + *
> > + * The worst case is that all tasks preempt one another in a 
> > migrate_disable()
> > + * region and stack on a single CPU. This then reduces the available 
> > bandwidth
> > + * to a single CPU. And since Real-Time schedulability theory considers the
> > + * Worst-Case only, all Real-Time analysis shall revert to single-CPU
> > + * (instantly solving the SMP analysis problem).
> 
> I'm telling you for years that SMP is the source of all evils and
> NR_CPUS=0 is the ultimate solution of all problems. Paul surely
> disagrees as he thinks that NR_CPUS<0 is the right thing to do.

Surely imaginary numbers are even better :-)

> But seriously, I completely understand your concern vs. schedulability
> theories, but those theories can neither deal well with preemption
> disable simply because you can create other trainwrecks when enough low
> priority tasks run long enough in preempt disabled regions in
> parallel.

Ah, no, those theories can deal with preemption disable perfectly fine.
The result is an increase in latency. It so happens we don't like that,
but that's our problem :-)

> The scheduler simply does not know ahead how long these
> sections will take and how many of them will run in parallel.

Ah, but the thing is, preempt_disable() does not limit concurrency.
Assuming idle CPUs, the waking task can always go elsewhere.

The thing with migrate_disable() OTOH is that even though there are idle
CPUs, we're actively prohibited from using them.

> The theories make some assumptions about preempt disable and consider it
> as temporary priority ceiling, but that's all assumptions as the bounds
> of these operations simply unknown.

Sure, that directly translates into unbounded (or rather of
non-deterministic duration) latencies, which are bad for determinism.
But the theory is fairly clear on this.

> > + * The reason we have it anyway.
> > + *
> > + * PREEMPT_RT breaks a number of assumptions traditionally held. By 
> > forcing a
> > + * number of primitives into becoming preemptible, they would also allow
> > + * migration. This turns out to break a bunch of per-cpu usage. To this 
> > end,
> > + * all these primitives employ migirate_disable() to restore this implicit
> > + * assumption.
> > + *
> > + * This is a 'temporary' work-around at best. The correct solution is 
> > getting
> > + * rid of the above assumptions and reworking the code to employ explicit
> > + * per-cpu locking or short preempt-disable regions.
> 
> What timeframe are you envisioning for 'temporary'? I assume something
> which is closer to your retirement than to mine :)

I figured we'd put a WARN on per-cpu usage with only migrate_disable(),
under a Kconfig knob, much like how RCU-lockdep started, once all of
PREEMPT_RT has landed. Gotta keep busy, right :-)

> > + * The end goal must be to get rid of migrate_disable(), alternatively we 
> > need
> > + * a schedulability theory that does not depend on abritrary migration.
> 
> Finally something new the academics can twist their brain around :)

I'm sure they've been waiting for more work ;-)

> But as the kmap discussion has shown, the current situation of enforcing
> preempt disable even on a !RT kernel is not pretty either. I looked at
> quite some of the kmap_atomic() usage sites and the resulting
> workarounds for non-preemptability are pretty horrible especially if
> they do copy_from/to_user() or such in those regions. There is tons of
> other code which really only requires migrate disable.

Yes, I'm on that thread, I'll reply there as well, I really hate going
down that path without having a decent understanding of the
ramifications.

The more we spread this muck around, the deeper the hole we dig for
ourselves to climb out of.

The thing is, afaik the only theory that 'works' with migrate_disable()
is fully partitioned, but we break that by having cross CPU blocking
chains.

Re: [RESEND PATCH V2 0/6] Support PCIe3 uncore PMU on Snow Ridge

2020-09-21 Thread peterz

On Mon, Sep 14, 2020 at 07:34:14AM -0700, kan.li...@linux.intel.com wrote:
> From: Kan Liang 
> 
> Changes since V1:
> - Drop the platform device solution
> - A new uncore PCI sub driver solution is introduced which searches
>   the PCIe Root Port device via pci_get_device() and id table.
>   Register a PCI bus notifier for the remove notification. Once the
>   device is removed, the uncore driver can be notified to unregister
>   the corresponding PMU.
> - Modify the parameters of uncore_pci_pmu_unregister() function.

Bjorn, you hated on the last version of this thing, are you OK with this
one?

Re: [PATCH 0/3] drm: commit_work scheduling

2020-09-21 Thread peterz

On Mon, Sep 21, 2020 at 11:21:54AM +0200, Daniel Vetter wrote:

> So question to rt/worker folks: What's the best way to let userspace set
> the scheduling mode and priorities of things the kernel does on its
> behalf? Surely we're not the first ones where if userspace runs with some
> rt priority it'll starve out the kernel workers that it needs. Hardcoding
> something behind a subsystem ioctl (which just means every time userspace
> changes what it does, we need a new such flag or mode) can't be the right
> thing.
> 
> Peter, Tejun?

So regular workqueues do not support RT priorities, but you can set
their nice value somewhere in /sys.

The kthread_work stuff used in these patches result in a regular kthread
and as such the user interface for changing its scheduling class or
priority is that of any other 'random' task.

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz

On Fri, Sep 18, 2020 at 12:48:24PM +0200, Oleg Nesterov wrote:

> Of course, this assumes that atomic_t->counter underflows "correctly", just
> like "unsigned int".

We're documented that we do. Lots of code relies on that.

See Documentation/atomic_t.txt TYPES

> But again, do we really want this?

I like the two counters better, avoids atomics entirely, some archs
hare horridly expensive atomics (*cough* power *cough*).

I just tried to be clever and use a single u64 load (where possible)
instead of two 32bit loads and got the sum vs split order wrong.

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz

On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> + u64 sum = per_cpu_sum(*(u64 *)sem->read_count);

Moo, that doesn't work, we have to do two separate sums. I shouldn't try
to be clever on a Friday I suppose :-(

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz

On Fri, Sep 18, 2020 at 12:04:32PM +0200, pet...@infradead.org wrote:
> On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> > @@ -198,7 +198,9 @@ EXPORT_SYMBOL_GPL(__percpu_down_read);
> >   */
> >  static bool readers_active_check(struct percpu_rw_semaphore *sem)
> >  {
> > -   if (per_cpu_sum(*sem->read_count) != 0)
> > +   u64 sum = per_cpu_sum(*(u64 *)sem->read_count);
> > +
> > +   if (sum + (sum >> 32))
> 
> That obviously wants to be:
> 
>   if ((u32)(sum + (sum >> 32)))
> 
> > return false;
> >  
> > /*

I suppose an alternative way of writing that would be something like:

union {
u64 sum;
struct {
u32 a, b;
};
} var;

var.sum = per_cpu_sum(*(u64 *)sem->read_count);

if (var.a + var.b)
return false;

which is more verbose, but perhaps easier to read.

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz

On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> @@ -198,7 +198,9 @@ EXPORT_SYMBOL_GPL(__percpu_down_read);
>   */
>  static bool readers_active_check(struct percpu_rw_semaphore *sem)
>  {
> - if (per_cpu_sum(*sem->read_count) != 0)
> + u64 sum = per_cpu_sum(*(u64 *)sem->read_count);
> +
> + if (sum + (sum >> 32))

That obviously wants to be:

if ((u32)(sum + (sum >> 32)))

>   return false;
>  
>   /*

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz

On Fri, Sep 18, 2020 at 11:07:02AM +0200, Jan Kara wrote:
> If people really wanted to avoid irq-safe inc/dec for archs where it is
> more expensive, one idea I had was that we could add 'read_count_in_irq' to
> percpu_rw_semaphore. So callers in normal context would use read_count and
> callers in irq context would use read_count_in_irq. And the writer side
> would sum over both but we don't care about performance of that one much.

That's not a bad idea... something like so I suppose.

(completely untested)

---
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5fda40f97fe9..9c847490a86a 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -11,7 +11,7 @@
 
 struct percpu_rw_semaphore {
struct rcu_sync rss;
-   unsigned int __percpu   *read_count;
+   u32 __percpu*read_count;
struct rcuwait  writer;
wait_queue_head_t   waiters;
atomic_tblock;
@@ -60,7 +60,7 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *sem)
 * anything we did within this RCU-sched read-size critical section.
 */
if (likely(rcu_sync_is_idle(&sem->rss)))
-   this_cpu_inc(*sem->read_count);
+   __this_cpu_inc(sem->read_count[0]);
else
__percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
/*
@@ -74,12 +74,16 @@ static inline bool percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
 {
bool ret = true;
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+   WARN_ON_ONCE(!in_task());
+#endif
+
preempt_disable();
/*
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(&sem->rss)))
-   this_cpu_inc(*sem->read_count);
+   __this_cpu_inc(sem->read_count[0]);
else
ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
preempt_enable();
@@ -98,12 +102,16 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 {
rwsem_release(&sem->dep_map, _RET_IP_);
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+   WARN_ON_ONCE(!in_task());
+#endif
+
preempt_disable();
/*
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(&sem->rss))) {
-   this_cpu_dec(*sem->read_count);
+   __this_cpu_dec(sem->read_count[0]);
} else {
/*
 * slowpath; reader will only ever wake a single blocked
@@ -115,12 +123,39 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 * aggregate zero, as that is the only time it matters) they
 * will also see our critical section.
 */
-   this_cpu_dec(*sem->read_count);
+   __this_cpu_dec(sem->read_count[0]);
rcuwait_wake_up(&sem->writer);
}
preempt_enable();
 }
 
+static inline void __percpu_up_read_irqsafe(struct percpu_rw_semaphore *sem)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   /*
+* Same as in percpu_down_read().
+*/
+   if (likely(rcu_sync_is_idle(&sem->rss))) {
+   __this_cpu_dec(sem->read_count[1]);
+   } else {
+   /*
+* slowpath; reader will only ever wake a single blocked
+* writer.
+*/
+   smp_mb(); /* B matches C */
+   /*
+* In other words, if they see our decrement (presumably to
+* aggregate zero, as that is the only time it matters) they
+* will also see our critical section.
+*/
+   __this_cpu_dec(sem->read_count[1]);
+   rcuwait_wake_up(&sem->writer);
+   }
+   local_irq_restore(flags);
+}
+
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a576f3f..00741216a7f6 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -12,7 +12,7 @@
 int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
const char *name, struct lock_class_key *key)
 {
-   sem->read_count = alloc_percpu(int);
+   sem->read_count = (u32 *)alloc_percpu(u64);
if (unlikely(!sem->read_count))
return -ENOMEM;
 
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
 static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
-   this_cpu_inc(*sem->read_count);
+   __this_cpu_inc(sem->read_count[0]);
 
/*
 * Due to having preemption disabled the decrement happens on
@@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
if (likely(!atomic_read_acquire(&sem->block)))

Re: [patch 09/10] sched/core: Add migrate_disable/enable()

2020-09-18 Thread peterz

On Fri, Sep 18, 2020 at 09:00:03AM +0200, Thomas Gleixner wrote:

> >> +void migrate_disable(void)
> >> +{
> >> +  unsigned long flags;
> >> +
> >> +  if (!current->migration_ctrl.disable_cnt) {
> >> +  raw_spin_lock_irqsave(¤t->pi_lock, flags);
> >> +  current->migration_ctrl.disable_cnt++;
> >> +  raw_spin_unlock_irqrestore(¤t->pi_lock, flags);
> >> +  } else {
> >> +  current->migration_ctrl.disable_cnt++;
> >> +  }
> >> +}
> >
> > That pi_lock seems unfortunate, and it isn't obvious what the point of
> > it is.
> 
> Indeed. That obviously lacks a big fat comment.
> 
> current->migration_ctrl.disable_cnt++ is obviously a RMW operation. So
> you end up with the following:
> 
> CPU0CPU1
> migrate_disable()
>R = current->migration_ctrl.disable_cnt;
> set_cpus_allowed_ptr()
>   task_rq_lock();
>   if
>   
> (!p->migration_ctrl.disable_cnt) {
>current->migration_ctrl.disable_cnt = R + 1;
>   stop_one_cpu();
> ---> stopper_thread()
> BUG_ON(task->migration_ctrl.disable_cnt);
> 
> I tried to back out from that instead of BUG(), but that ended up being
> even more of a hacky trainwreck than just biting the bullet and taking
> pi_lock.

You don't need the load-store for that I think, pure timing will do.
Blergh, lemme prepare more wake-up juice and think about that.

Re: [patch 09/10] sched/core: Add migrate_disable/enable()

2020-09-18 Thread peterz

On Thu, Sep 17, 2020 at 06:30:01PM +0200, Sebastian Siewior wrote:
> On 2020-09-17 17:54:10 [+0200], pet...@infradead.org wrote:
> > I'm not sure what the problem with FPU was, I was throwing alternatives
> > at tglx to see what would stick, in part to (re)discover the design
> > constraints of this thing.
> 
> was this recent or distant in the time line?

The past few weeks :-) Thomas and me have been bickering about this
stuff on IRC on and off.

> > One reason for not allowing migrate_disable() to sleep was: FPU code.
> > 
> > Could it be it does something like:
> > 
> > preempt_disable();
> > spin_lock();
> > 
> > spin_unlock();
> > preempt_enable();
> > 
> > Where we'll never get preempted while migrate_disable()'d and thus never
> > trigger any of the sleep paths?
> 
> I try to get rid of something like that. This doesn't work either way
> because the spin_lock() may block which it can't with disabled
> preemption.

Yeah, that obviously should have been migrate_disable/enable instead of
spin_lock/unlock :/

> Ah. We used to have migrate_disable() in pagefault_disable(). The x86
> FPU code does
>   preempt_disable();
>   …
>   pagefault_disable();
> 
> but that migrate_disable() was moved from pagefault_disable() to
> kmap_atomic(). We shouldn't have
>   preempt_disable(); || local_irq_disable();
>   kmap_atomic();
> 
> on RT. I've been running around removing those. See
>a10dcebacdb0c ("fs/ntfs/aops.c: don't disable interrupts during 
> kmap_atomic()")
>ce1e518190ea7 ("ide: don't disable interrupts during kmap_atomic()")
>f3a1075e5fc34 ("block: don't disable interrupts during kmap_atomic()")

Hmm, okay.

Re: [patch 09/10] sched/core: Add migrate_disable/enable()

2020-09-17 Thread peterz

On Thu, Sep 17, 2020 at 05:13:41PM +0200, Sebastian Siewior wrote:
> On 2020-09-17 16:49:37 [+0200], pet...@infradead.org wrote:
> > I'm aware of the duct-tape :-) But I was under the impression that we
> > didn't want the duct-tape, and that there was lots of issues with the
> > FPU code, or was that another issue?
> 
> Of course it would be better not to need the duct tape. 
> Also symmetrical locking is what you want but clearly futex is one of
> a kind.
> 
> I'm currently not aware of any issues in the FPU code in regard to this.
> A few weeks ago, I was looking for this kind of usage and only futex
> popped up.

I'm not sure what the problem with FPU was, I was throwing alternatives
at tglx to see what would stick, in part to (re)discover the design
constraints of this thing.

One reason for not allowing migrate_disable() to sleep was: FPU code.

Could it be it does something like:

preempt_disable();
spin_lock();

spin_unlock();
preempt_enable();

Where we'll never get preempted while migrate_disable()'d and thus never
trigger any of the sleep paths?

Re: [PATCH net-next 6/7] lockdep: provide dummy forward declaration of *_is_held() helpers

2020-09-17 Thread peterz

On Wed, Sep 16, 2020 at 11:45:27AM -0700, Jakub Kicinski wrote:
> When CONFIG_LOCKDEP is not set, lock_is_held() and lockdep_is_held()
> are not declared or defined. This forces all callers to use ifdefs
> around these checks.
> 
> Recent RCU changes added a lot of lockdep_is_held() calls inside
> rcu_dereference_protected(). rcu_dereference_protected() hides
> its argument on !LOCKDEP builds, but this may lead to unused variable
> warnings.
> 
> Provide forward declarations of lock_is_held() and lockdep_is_held()
> but never define them. This way callers can keep them visible to
> the compiler on !LOCKDEP builds and instead depend on dead code
> elimination to remove the references before the linker barfs.
> 
> We need lock_is_held() for RCU.
> 
> Signed-off-by: Jakub Kicinski 
> --
> CC: pet...@infradead.org
> CC: mi...@redhat.com
> CC: w...@kernel.org
> ---
>  include/linux/lockdep.h | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index 6a584b3e5c74..6b5bbc536bf6 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -371,6 +371,12 @@ static inline void lockdep_unregister_key(struct 
> lock_class_key *key)
>  
>  #define lockdep_depth(tsk)   (0)
>  
> +/*
> + * Dummy forward declarations, allow users to write less ifdef-y code
> + * and depend on dead code elimination.
> + */
> +int lock_is_held(const void *);

extern int lock_is_held(const struct lockdep_map *);

> +int lockdep_is_held(const void *);

extern

I suppose we can't pull the lockdep_is_held() definition out from under
CONFIG_LOCKDEP because it does the ->dep_map dereference and many types
will simply not have that member.

>  #define lockdep_is_held_type(l, r)   (1)
>  
>  #define lockdep_assert_held(l)   do { (void)(l); } while 
> (0)

Re: [patch 09/10] sched/core: Add migrate_disable/enable()

2020-09-17 Thread peterz

On Thu, Sep 17, 2020 at 04:38:50PM +0200, Sebastian Siewior wrote:
> On 2020-09-17 16:24:38 [+0200], pet...@infradead.org wrote:
> > And if I'm not mistaken, the above migrate_enable() *does* require being
> > able to schedule, and our favourite piece of futex:
> > 
> > raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
> > spin_unlock(q.lock_ptr);
> > 
> > is broken. Consider that spin_unlock() doing migrate_enable() with a
> > pending sched_setaffinity().
> 
> There are two instances of the above and only in the futex code and we
> have sort of duct tape for that by manually balancing the migrate
> counter so that it does not come to this.
> But yes, not having to do the manual balance is a plus.

I'm aware of the duct-tape :-) But I was under the impression that we
didn't want the duct-tape, and that there was lots of issues with the
FPU code, or was that another issue?

Re: [patch 09/10] sched/core: Add migrate_disable/enable()

2020-09-17 Thread peterz

On Thu, Sep 17, 2020 at 11:42:11AM +0200, Thomas Gleixner wrote:

> +static inline void update_nr_migratory(struct task_struct *p, long delta)
> +{
> + if (p->nr_cpus_allowed > 1 && p->sched_class->update_migratory)
> + p->sched_class->update_migratory(p, delta);
> +}

Right, so as you know, I totally hate this thing :-) It adds a second
(and radically different) version of changing affinity. I'm working on a
version that uses the normal *set_cpus_allowed*() interface.

> +/*
> + * The migrate_disable/enable() fastpath updates only the tasks migrate
> + * disable count which is sufficient as long as the task stays on the CPU.
> + *
> + * When a migrate disabled task is scheduled out it can become subject to
> + * load balancing. To prevent this, update task::cpus_ptr to point to the
> + * current CPUs cpumask and set task::nr_cpus_allowed to 1.
> + *
> + * If task::cpus_ptr does not point to task::cpus_mask then the update has
> + * been done already. This check is also used in in migrate_enable() as an
> + * indicator to restore task::cpus_ptr to point to task::cpus_mask
> + */
> +static inline void sched_migration_ctrl(struct task_struct *prev, int cpu)
> +{
> + if (!prev->migration_ctrl.disable_cnt ||
> + prev->cpus_ptr != &prev->cpus_mask)
> + return;
> +
> + prev->cpus_ptr = cpumask_of(cpu);
> + update_nr_migratory(prev, -1);
> + prev->nr_cpus_allowed = 1;
> +}

So this thing is called from schedule(), with only rq->lock held, and
that violates the locking rules for changing the affinity.

I have a comment that explains how it's broken and why it's sort-of
working.

> +void migrate_disable(void)
> +{
> + unsigned long flags;
> +
> + if (!current->migration_ctrl.disable_cnt) {
> + raw_spin_lock_irqsave(¤t->pi_lock, flags);
> + current->migration_ctrl.disable_cnt++;
> + raw_spin_unlock_irqrestore(¤t->pi_lock, flags);
> + } else {
> + current->migration_ctrl.disable_cnt++;
> + }
> +}

That pi_lock seems unfortunate, and it isn't obvious what the point of
it is.

> +void migrate_enable(void)
> +{
> + struct task_migrate_data *pending;
> + struct task_struct *p = current;
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (WARN_ON_ONCE(p->migration_ctrl.disable_cnt <= 0))
> + return;
> +
> + if (p->migration_ctrl.disable_cnt > 1) {
> + p->migration_ctrl.disable_cnt--;
> + return;
> + }
> +
> + raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> + p->migration_ctrl.disable_cnt = 0;
> + pending = p->migration_ctrl.pending;
> + p->migration_ctrl.pending = NULL;
> +
> + /*
> +  * If the task was never scheduled out while in the migrate
> +  * disabled region and there is no migration request pending,
> +  * return.
> +  */
> + if (!pending && p->cpus_ptr == &p->cpus_mask) {
> + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> + return;
> + }
> +
> + rq = __task_rq_lock(p, &rf);
> + /* Was it scheduled out while in a migrate disabled region? */
> + if (p->cpus_ptr != &p->cpus_mask) {
> + /* Restore the tasks CPU mask and update the weight */
> + p->cpus_ptr = &p->cpus_mask;
> + p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask);
> + update_nr_migratory(p, 1);
> + }
> +
> + /* If no migration request is pending, no further action required. */
> + if (!pending) {
> + task_rq_unlock(rq, p, &rf);
> + return;
> + }
> +
> + /* Migrate self to the requested target */
> + pending->res = set_cpus_allowed_ptr_locked(p, pending->mask,
> +pending->check, rq, &rf);
> + complete(pending->done);
> +}

So, what I'm missing with all this are the design contraints for this
trainwreck. Because the 'sane' solution was having migrate_disable()
imply cpus_read_lock(). But that didn't fly because we can't have
migrate_disable() / migrate_enable() schedule for raisins.

And if I'm not mistaken, the above migrate_enable() *does* require being
able to schedule, and our favourite piece of futex:

raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);

is broken. Consider that spin_unlock() doing migrate_enable() with a
pending sched_setaffinity().

Let me ponder this more..

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread peterz

On Thu, Sep 17, 2020 at 01:48:38PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:
> > IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even
> > before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this
> > doesn't matter...
> > 
> > Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid
> > file_end_write() in IRQ context, but I am not sure it's worth the trouble.
> 
> If we change bio_endio to invoke the ->bi_end_io callbacks in softirq
> context instead of hardirq context, we can change the pagecache to take

SoftIRQ context has exactly the same problem vs __this_cpu*().

Re: [RFC][PATCH 1/4] acpi: Use CPUIDLE_FLAG_TIMER_STOP

2020-09-16 Thread peterz

On Tue, Sep 15, 2020 at 06:26:52PM +0200, Rafael J. Wysocki wrote:
> On Tue, Sep 15, 2020 at 12:44 PM Peter Zijlstra  wrote:
> >
> > Make acpi_processor_idle use the common broadcast code, there's no
> > reason not to. This also removes some RCU usage after
> > rcu_idle_enter().
> >
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> The whole series looks good to me, so please feel free to add
> 
> Acked-by: Rafael J. Wysocki 
> 
> to all of the four patches.
> 
> Alternatively, please let me know if you want me to take the patches.

Feel free to take them. All the prerequisite borkage is in linus' tree
already.

Re: [PATCH 2/2] sched/hotplug: Ensure only per-cpu kthreads run during hotplug

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 03:58:17PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-09-16 14:10:20 [+0200], pet...@infradead.org wrote:
> 
> squeeze that in please:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a4fe22b8b8418..bed3cd28af578 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6866,7 +6866,7 @@ static int __balance_push_cpu_stop(void *arg)
>   raw_spin_lock_irq(&p->pi_lock);
>   rq_lock(rq, &rf);
>  
> - update_rq_clock();
> + update_rq_clock(rq);
>  
>   if (task_rq(p) == rq && task_on_rq_queued(p)) {
>   cpu = select_fallback_rq(rq->cpu, p);
> 
> 
> and count me in :)

Duh... /me goes in search of the brown paper bag -- again!

Re: [PATCHv2] perf: Fix race in perf_mmap_close function

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 01:53:11PM +0200, Jiri Olsa wrote:
> There's a possible race in perf_mmap_close when checking ring buffer's
> mmap_count refcount value. The problem is that the mmap_count check is
> not atomic because we call atomic_dec and atomic_read separately.
> 
>   perf_mmap_close:
>   ...
>atomic_dec(&rb->mmap_count);
>...
>if (atomic_read(&rb->mmap_count))
>   goto out_put;
> 
>
>free_uid
> 
> out_put:
>   ring_buffer_put(rb); /* could be last */
> 
> The race can happen when we have two (or more) events sharing same ring
> buffer and they go through atomic_dec and then they both see 0 as refcount
> value later in atomic_read. Then both will go on and execute code which
> is meant to be run just once.

The trival case should be protected by mmap_sem, we call vm_ops->close()
with mmap_sem held for writing IIRC. But yes, I think it's possible to
construct a failure here.

Re: [PATCH 16/26] perf tools: Synthesize modules with mmap3

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 12:10:21PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Sep 16, 2020 at 04:17:00PM +0200, pet...@infradead.org escreveu:
> > On Wed, Sep 16, 2020 at 11:07:44AM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Wed, Sep 16, 2020 at 10:20:18AM +0200, Jiri Olsa escreveu:
> > 
> > > > > IIRC BUILD_ID_SIZE is 20 bytes which is the correct size for SHA-1. A
> > > > > build ID may be 128-bits (16 bytes) if md5 or uuid hashes are used.
> > > > > Should this test just be "> 0" ?
> > > > 
> > > > ah right, will check on that
> > > 
> > > And how do you deal with this in the kernel? I.e. to inform userspace,
> > > via the PERF_RECORD_MMAP3 (or MMAP2 with that misc bit trick) the size
> > > of the build-id?
> > 
> > The union size is 24 bytes, so there's plenty space to store a length
> > field with the buildid.
> 
> So, I think we should instead use a bit in the misc field, stating the
> kind of build-id, so that we don't waste a byte for that, I think.

There's no wastage:

u32 min, maj;
u64 ino;
u64 ino_generation;

is 24 bytes, buildit is 20 bytes, that leaves us 4 bytes to encode the
buildid type without growing anything.

Re: [PATCH v2] lockdep: Allow tuning tracing capacity constants.

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 01:28:19PM +0200, Dmitry Vyukov wrote:
> On Fri, Sep 4, 2020 at 6:05 PM Tetsuo Handa
>  wrote:
> >
> > Hello. Can we apply this patch?
> >
> > This patch addresses top crashers for syzbot, and applying this patch
> > will help utilizing syzbot's resource for finding other bugs.
> 
> Acked-by: Dmitry Vyukov 
> 
> Peter, do you still have concerns with this?

Yeah, I still hate it with a passion; it discourages thinking. A bad
annotation that blows up the lockdep storage, no worries, we'll just
increase this :/

IIRC the issue with syzbot is that the current sysfs annotation is
pretty terrible and generates a gazillion classes, and syzbot likes
poking at /sys a lot and thus floods the system.

I don't know enough about sysfs to suggest an alternative, and haven't
exactly had spare time to look into it either :/

Examples of bad annotations is getting every CPU a separate class, that
leads to nr_cpus! chains if CPUs arbitrarily nest (nr_cpus^2 if there's
only a single nesting level).

Re: [PATCH 16/26] perf tools: Synthesize modules with mmap3

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 11:07:44AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Sep 16, 2020 at 10:20:18AM +0200, Jiri Olsa escreveu:

> > > IIRC BUILD_ID_SIZE is 20 bytes which is the correct size for SHA-1. A
> > > build ID may be 128-bits (16 bytes) if md5 or uuid hashes are used.
> > > Should this test just be "> 0" ?
> > 
> > ah right, will check on that
> 
> And how do you deal with this in the kernel? I.e. to inform userspace,
> via the PERF_RECORD_MMAP3 (or MMAP2 with that misc bit trick) the size
> of the build-id?

The union size is 24 bytes, so there's plenty space to store a length
field with the buildid.

Re: [PATCH 2/2] sched/hotplug: Ensure only per-cpu kthreads run during hotplug

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 12:18:45PM +0200, Sebastian Andrzej Siewior wrote:
> With this on top of -rc5 I get:
> 
> [   42.678670] process 1816 (hackbench) no longer affine to cpu2
> [   42.678684] process 1817 (hackbench) no longer affine to cpu2
> [   42.710502] [ cut here ]
> [   42.711505] rq->clock_update_flags < RQCF_ACT_SKIP
> [   42.711514] WARNING: CPU: 2 PID: 19 at kernel/sched/sched.h:1141 
> update_curr+0xe3/0x3f0

> [   42.736635]  select_fallback_rq+0x129/0x160
> [   42.737450]  __balance_push_stop+0x132/0x230

Duh.. I wonder why I didn't see that.. anyway, I think the below version
ought to cure that.

> That comes due to DEQUEUE_NOCLOCK in do_set_cpus_allowed(). And then
> there is this:
> 
> [   42.752454] ==
> [   42.752455] WARNING: possible circular locking dependency detected
> [   42.752455] 5.9.0-rc5+ #107 Not tainted
> [   42.752455] --
> [   42.752456] migration/2/19 is trying to acquire lock:
> [   42.752456] 824639f8 ((console_sem).lock){-.-.}-{2:2}, at: 
> down_trylock+0xa/0x30
> [   42.752458] but task is already holding lock:
> [   42.752458] 888276ca9598 (&rq->lock){-.-.}-{2:2}, at: 
> __balance_push_stop+0x50/0x230
> [   42.752460] which lock already depends on the new lock.

Yeah, that's the known issue with printk()...


---
Subject: sched/hotplug: Ensure only per-cpu kthreads run during hotplug
From: Peter Zijlstra 
Date: Fri Sep 11 09:54:27 CEST 2020

In preparation for migrate_disable(), make sure only per-cpu kthreads
are allowed to run on !active CPUs.

This is ran (as one of the very first steps) from the cpu-hotplug
task which is a per-cpu kthread and completion of the hotplug
operation only requires such tasks.

This constraint enables the migrate_disable() implementation to wait
for completion of all migrate_disable regions on this CPU at hotplug
time without fear of any new ones starting.

This replaces the unlikely(rq->balance_callbacks) test at the tail of
context_switch with an unlikely(rq->balance_work), the fast path is
not affected.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  |  114 ++-
 kernel/sched/sched.h |5 ++
 2 files changed, 117 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3513,8 +3513,10 @@ static inline struct callback_head *spli
struct callback_head *head = rq->balance_callback;
 
lockdep_assert_held(&rq->lock);
-   if (head)
+   if (head) {
rq->balance_callback = NULL;
+   rq->balance_flags &= ~BALANCE_WORK;
+   }
 
return head;
 }
@@ -3535,6 +3537,22 @@ static inline void balance_callbacks(str
}
 }
 
+static bool balance_push(struct rq *rq);
+
+static inline void balance_switch(struct rq *rq)
+{
+   if (unlikely(rq->balance_flags)) {
+   /*
+* Run the balance_callbacks, except on hotplug
+* when we need to push the current task away.
+*/
+   if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+   !(rq->balance_flags & BALANCE_PUSH) ||
+   !balance_push(rq))
+   __balance_callbacks(rq);
+   }
+}
+
 #else
 
 static inline void __balance_callbacks(struct rq *rq)
@@ -3550,6 +3568,10 @@ static inline void balance_callbacks(str
 {
 }
 
+static inline void balance_switch(struct rq *rq)
+{
+}
+
 #endif
 
 static inline void
@@ -3577,7 +3599,7 @@ static inline void finish_lock_switch(st
 * prev into current:
 */
spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
-   __balance_callbacks(rq);
+   balance_switch(rq);
raw_spin_unlock_irq(&rq->lock);
 }
 
@@ -6836,6 +6858,89 @@ static void migrate_tasks(struct rq *dea
 
rq->stop = stop;
 }
+
+static int __balance_push_cpu_stop(void *arg)
+{
+   struct task_struct *p = arg;
+   struct rq *rq = this_rq();
+   struct rq_flags rf;
+   int cpu;
+
+   raw_spin_lock_irq(&p->pi_lock);
+   rq_lock(rq, &rf);
+
+   update_rq_clock();
+
+   if (task_rq(p) == rq && task_on_rq_queued(p)) {
+   cpu = select_fallback_rq(rq->cpu, p);
+   rq = __migrate_task(rq, &rf, p, cpu);
+   }
+
+   rq_unlock(rq, &rf);
+   raw_spin_unlock_irq(&p->pi_lock);
+
+   put_task_struct(p);
+
+   return 0;
+}
+
+static DEFINE_PER_CPU(struct cpu_stop_work, push_work);
+
+/*
+ * Ensure we only run per-cpu kthreads once the CPU goes !active.
+ */
+static bool balance_push(struct rq *rq)
+{
+   struct task_struct *push_task = rq->curr;
+
+   lockdep_assert_held(&rq->lock);
+   SCHED_WARN_ON(rq->cpu != smp_processor_id());
+
+   /*
+* Both the cpu-hotplug and stop task are in this case and are
+* required to complete the hotplug process.
+*/
+   if (is

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 08:32:20PM +0800, Hou Tao wrote:

> I have simply test the performance impact on both x86 and aarch64.
> 
> There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
> per core)

Yeah, x86 is magical here, it's the same single instruction for both ;-)
But it is, afaik, unique in this position, no other arch can pull that
off.

> However the performance degradation is huge under aarch64 (4 sockets, 24 core 
> per sockets): nearly 60% lost.
> 
> v4.19.111
> no writer, reader cn   | 24| 48| 
> 72| 96
> the rate of down_read/up_read per second   | 166129572 | 166064100 | 
> 165963448 | 165203565
> the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
> 63757267 |  63514920

Teh hurt :/

Re: [PATCH v2 0/5] seqlock: Introduce PREEMPT_RT support

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 08:52:07AM -0400, Qian Cai wrote:
> On Tue, 2020-09-15 at 16:30 +0200, pet...@infradead.org wrote:
> > On Tue, Sep 15, 2020 at 08:48:17PM +0800, Boqun Feng wrote:
> > > I think this happened because seqcount_##lockname##_init() is defined at
> > > function rather than macro, so when the seqcount_init() gets expand in
> > 
> > Bah! I hate all this :/
> > 
> > I suspect the below, while more verbose than I'd like is the best
> > option.
> 
> Stephen, can you add this patch for now until Peter beats you to it?

Did you verify it works? I only wrote it..

Re: [PATCHv2] perf: Fix race in perf_mmap_close function

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 01:53:11PM +0200, Jiri Olsa wrote:
> There's a possible race in perf_mmap_close when checking ring buffer's
> mmap_count refcount value. The problem is that the mmap_count check is
> not atomic because we call atomic_dec and atomic_read separately.
> 
>   perf_mmap_close:
>   ...
>atomic_dec(&rb->mmap_count);
>...
>if (atomic_read(&rb->mmap_count))
>   goto out_put;
> 
>
>free_uid
> 
> out_put:
>   ring_buffer_put(rb); /* could be last */
> 
> The race can happen when we have two (or more) events sharing same ring
> buffer and they go through atomic_dec and then they both see 0 as refcount
> value later in atomic_read. Then both will go on and execute code which
> is meant to be run just once.
> 
> The code that detaches ring buffer is probably fine to be executed more
> than once, but the problem is in calling free_uid, which will later on
> demonstrate in related crashes and refcount warnings, like:
> 
>   refcount_t: addition on 0; use-after-free.
>   ...
>   RIP: 0010:refcount_warn_saturate+0x6d/0xf
>   ...
>   Call Trace:
>   prepare_creds+0x190/0x1e0
>   copy_creds+0x35/0x172
>   copy_process+0x471/0x1a80
>   _do_fork+0x83/0x3a0
>   __do_sys_wait4+0x83/0x90
>   __do_sys_clone+0x85/0xa0
>   do_syscall_64+0x5b/0x1e0
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Using atomic decrease and check instead of separated calls.
> This fixes CVE-2020-14351.

I'm tempted to remove that line; I can never seem to find anything
useful in a CVE.

Fixes: ?

> Acked-by: Namhyung Kim 
> Tested-by: Michael Petlan 
> Signed-off-by: Jiri Olsa

Re: [PATCH v2 0/5] seqlock: Introduce PREEMPT_RT support

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 09:00:59AM -0400, Qian Cai wrote:
> 
> 
> - Original Message -
> > On Wed, Sep 16, 2020 at 08:52:07AM -0400, Qian Cai wrote:
> > > On Tue, 2020-09-15 at 16:30 +0200, pet...@infradead.org wrote:
> > > > On Tue, Sep 15, 2020 at 08:48:17PM +0800, Boqun Feng wrote:
> > > > > I think this happened because seqcount_##lockname##_init() is defined
> > > > > at
> > > > > function rather than macro, so when the seqcount_init() gets expand in
> > > > 
> > > > Bah! I hate all this :/
> > > > 
> > > > I suspect the below, while more verbose than I'd like is the best
> > > > option.
> > > 
> > > Stephen, can you add this patch for now until Peter beats you to it?
> > 
> > Did you verify it works? I only wrote it..
> 
> Yes, I did.

Excellent, I'll stick a Tested-by from you on then.

Re: [tip:x86/seves] BUILD SUCCESS WITH WARNING e6eb15c9ba3165698488ae5c34920eea20eaa38e

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 10:46:41AM +0200, Marco Elver wrote:
> On Wed, 16 Sep 2020 at 10:30,  wrote:
> > On Tue, Sep 15, 2020 at 08:09:16PM +0200, Marco Elver wrote:
> > > On Tue, 15 Sep 2020 at 19:40, Nick Desaulniers  
> > > wrote:
> > > > On Tue, Sep 15, 2020 at 10:21 AM Borislav Petkov  wrote:
> >
> > > > > init/calibrate.o: warning: objtool: asan.module_ctor()+0xc: call 
> > > > > without frame pointer save/setup
> > > > > init/calibrate.o: warning: objtool: asan.module_dtor()+0xc: call 
> > > > > without frame pointer save/setup
> > > > > init/version.o: warning: objtool: asan.module_ctor()+0xc: call 
> > > > > without frame pointer save/setup
> > > > > init/version.o: warning: objtool: asan.module_dtor()+0xc: call 
> > > > > without frame pointer save/setup
> > > > > certs/system_keyring.o: warning: objtool: asan.module_ctor()+0xc: 
> > > > > call without frame pointer save/setup
> > > > > certs/system_keyring.o: warning: objtool: asan.module_dtor()+0xc: 
> > > > > call without frame pointer save/setup
> > >
> > > This one also appears with Clang 11. This is new I think because we
> > > started emitting ASAN ctors for globals redzone initialization.
> > >
> > > I think we really do not care about precise stack frames in these
> > > compiler-generated functions. So, would it be reasonable to make
> > > objtool ignore all *san.module_ctor and *san.module_dtor functions (we
> > > have them for ASAN, TSAN, MSAN)?
> >
> > The thing is, if objtool cannot follow, it cannot generate ORC data and
> > our unwinder cannot unwind through the instrumentation, and that is a
> > fail.
> >
> > Or am I missing something here?
> 
> They aren't about the actual instrumentation. The warnings are about
> module_ctor/module_dtor functions which are compiler-generated, and
> these are only called on initialization/destruction (dtors only for
> modules I guess).
> 
> E.g. for KASAN it's the calls to __asan_register_globals that are
> called from asan.module_ctor. For KCSAN the tsan.module_ctor is
> effectively a noop (because __tsan_init() is a noop), so it really
> doesn't matter much.
> 
> Is my assumption correct that the only effect would be if something
> called by them fails, we just don't see the full stack trace? I think
> we can live with that, there are only few central places that deal
> with ctors/dtors (do_ctors(), ...?).

Not only fails, lockdep for example likes to store stack traces of
various callsites etc.. Also perf (NMI) likes to think it can unwind at
all times.

> The "real" fix would be to teach the compilers about "frame pointer
> save/setup" for generated functions, but I don't think that's
> realistic.

How is that unrealistic? If you build with framepointers enabled, the
compiler is supposed to know about this stuff.

Re: [tip:x86/seves] BUILD SUCCESS WITH WARNING e6eb15c9ba3165698488ae5c34920eea20eaa38e

2020-09-16 Thread peterz

On Tue, Sep 15, 2020 at 08:09:16PM +0200, Marco Elver wrote:
> On Tue, 15 Sep 2020 at 19:40, Nick Desaulniers  
> wrote:
> > On Tue, Sep 15, 2020 at 10:21 AM Borislav Petkov  wrote:

> > > init/calibrate.o: warning: objtool: asan.module_ctor()+0xc: call without 
> > > frame pointer save/setup
> > > init/calibrate.o: warning: objtool: asan.module_dtor()+0xc: call without 
> > > frame pointer save/setup
> > > init/version.o: warning: objtool: asan.module_ctor()+0xc: call without 
> > > frame pointer save/setup
> > > init/version.o: warning: objtool: asan.module_dtor()+0xc: call without 
> > > frame pointer save/setup
> > > certs/system_keyring.o: warning: objtool: asan.module_ctor()+0xc: call 
> > > without frame pointer save/setup
> > > certs/system_keyring.o: warning: objtool: asan.module_dtor()+0xc: call 
> > > without frame pointer save/setup
> 
> This one also appears with Clang 11. This is new I think because we
> started emitting ASAN ctors for globals redzone initialization.
> 
> I think we really do not care about precise stack frames in these
> compiler-generated functions. So, would it be reasonable to make
> objtool ignore all *san.module_ctor and *san.module_dtor functions (we
> have them for ASAN, TSAN, MSAN)?

The thing is, if objtool cannot follow, it cannot generate ORC data and
our unwinder cannot unwind through the instrumentation, and that is a
fail.

Or am I missing something here?

Re: [PATCH] arch: x86: power: cpu: init %gs before __restore_processor_state (clang)

2020-09-16 Thread peterz

On Tue, Sep 15, 2020 at 12:51:47PM -0700, Nick Desaulniers wrote:
> It would be much nicer if we had the flexibility to disable stack
> protectors per function, rather than per translation unit.  I'm going
> to encourage you to encourage your favorite compile vendor ("write to
> your senator") to support the function attribute
> __attribute__((no_stack_protector)) so that one day, we can use that
> to stop shipping crap like a9a3ed1eff360 ("x86: Fix early boot crash
> on gcc-10, third try"). 

I think we were all in favour of having that, not sure why it hasn't
happened yet. More important matters I suppose :/

Re: [PATCH] stats: Replace seq_printf with seq_puts

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 03:44:15AM +, Xu Wang wrote:
> seq_puts is a lot cheaper than seq_printf, so use that to print
> literal strings.

performance is not a consideration here.

> Signed-off-by: Xu Wang 
> ---
>  kernel/sched/stats.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 750fb3c67eed..0818fe03407a 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -37,7 +37,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
>   rq->rq_cpu_time,
>   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
>  
> - seq_printf(seq, "\n");
> + seq_putc(seq, '\n');

And yet, the changelog says seq_put*s*().

>  
>  #ifdef CONFIG_SMP
>   /* domain-specific stats */
> -- 
> 2.17.1
>

Re: [RFC] autonuma: Migrate on fault among multiple bound nodes

2020-09-16 Thread peterz

On Wed, Sep 16, 2020 at 08:59:36AM +0800, Huang Ying wrote:

> So in this patch, if MPOL_BIND is used to bind the memory of the
> application to multiple nodes, and in the hint page fault handler both
> the faulting page node and the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node
> to reduce the cross-node accessing.

Seems fair enough..

> Questions:
> 
> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA
> optimizing globally.  And now, it appears that the explicit NUMA
> memory policy specifying (e.g. via numactl, mbind(), etc.) acts like
> an implicit per-thread/VMA knob to enable/disable the AutoNUMA
> optimizing for the thread/VMA.  Although this looks like a side effect
> instead of an API, from commit fc3147245d19 ("mm: numa: Limit NUMA
> scanning to migrate-on-fault VMAs"), this is used by some users?  So
> the question is, do we need an explicit per-thread/VMA knob to
> enable/disable AutoNUMA optimizing for the thread/VMA?  Or just use
> the global knob, either optimize all thread/VMAs as long as the
> explicitly specified memory policies are respected, or don't optimize
> at all.

I don't understand the question; that commit is not about disabling numa
balancing, it's about avoiding pointless work and overhead. What's the
point of scanning memory if you're not going to be allowed to move it
anyway.

> Signed-off-by: "Huang, Ying" 
> Cc: Andrew Morton 
> Cc: Ingo Molnar 
> Cc: Mel Gorman 
> Cc: Rik van Riel 
> Cc: Johannes Weiner 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Dave Hansen 
> Cc: Andi Kleen 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> ---
>  mm/mempolicy.c | 43 +++
>  1 file changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index eddbe4e56c73..a941eab2de24 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1827,6 +1827,13 @@ static struct mempolicy *get_vma_policy(struct 
> vm_area_struct *vma,
>   return pol;
>  }
>  
> +static bool mpol_may_mof(struct mempolicy *pol)
> +{
> + /* May migrate among bound nodes for MPOL_BIND */
> + return pol->flags & MPOL_F_MOF ||
> + (pol->mode == MPOL_BIND && nodes_weight(pol->v.nodes) > 1);
> +}

This is weird, why not just set F_MOF on the policy?

In fact, why wouldn't something like:

  mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY);

work today? Afaict MF_LAZY will unconditionally result in M_MOF.

> @@ -2494,20 +2503,30 @@ int mpol_misplaced(struct page *page, struct 
> vm_area_struct *vma, unsigned long
>   break;
>  
>   case MPOL_BIND:
>   /*
> +  * Allows binding to multiple nodes.  If both current and
> +  * accessing nodes are in policy nodemask, migrate to
> +  * accessing node to optimize page placement. Otherwise,
> +  * use current page if in policy nodemask or MPOL_F_MOF not
> +  * set, else select nearest allowed node, if any.  If no
> +  * allowed nodes, use current [!misplaced].
>*/
> + if (node_isset(curnid, pol->v.nodes)) {
> + if (node_isset(thisnid, pol->v.nodes)) {
> + moron = true;
> + polnid = thisnid;
> + } else {
> + goto out;
> + }
> + } else if (!(pol->flags & MPOL_F_MOF)) {
>   goto out;
> + } else {
> + z = first_zones_zonelist(
>   node_zonelist(numa_node_id(), GFP_HIGHUSER),
>   gfp_zone(GFP_HIGHUSER),
>   &pol->v.nodes);
> + polnid = zone_to_nid(z->zone);
> + }
>   break;
>  
>   default:

Did that want to be this instead? I don't think I follow the other
changes.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..2a64913f9ac6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2501,8 +2501,11 @@ int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma, unsigned long
 * else select nearest allowed node, if any.
 * If no allowed nodes, use current [!misplaced].
 */
-   if (node_isset(curnid, pol->v.nodes))
+   if (node_isset(curnid, pol->v.nodes)) {
+   if (node_isset(thisnod, pol->v.nodes))
+   goto moron;
goto out;
+   }
z = first_zones_zonelist(
node_zonelist(numa_node_id(), GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
@@ -2516,6 +2519,7 @@ int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma, unsigned long
 
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_

Re: [PATCH] x86/smap: Fix the smap_save() asm

2020-09-16 Thread peterz

On Tue, Sep 15, 2020 at 01:55:58PM -0700, Andy Lutomirski wrote:
> The old smap_save() code was:
> 
>   pushf
>   pop %0
> 
> with %0 defined by an "=rm" constraint.  This is fine if the
> compiler picked the register option, but it was incorrect with an
> %rsp-relative memory operand.  With some intentional abuse, I can
> get both gcc and clang to generate code along these lines:
> 
>   pushfq
>   popq 0x8(%rsp)
>   mov 0x8(%rsp),%rax
> 
> which is incorrect and will not work as intended.

We need another constraint :-)

> Fix it by removing the memory option.  This issue is exacerbated by
> a clang optimization bug:
> 
>   https://bugs.llvm.org/show_bug.cgi?id=47530
> 
> Fixes: e74deb11931f ("x86/uaccess: Introduce user_access_{save,restore}()")
> Cc: sta...@vger.kernel.org
> Reported-by: Bill Wendling  # I think
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/smap.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/smap.h b/arch/x86/include/asm/smap.h
> index 8b58d6975d5d..be6d675ae3ac 100644
> --- a/arch/x86/include/asm/smap.h
> +++ b/arch/x86/include/asm/smap.h
> @@ -61,7 +61,7 @@ static __always_inline unsigned long smap_save(void)
> ALTERNATIVE("jmp 1f", "", X86_FEATURE_SMAP)
> "pushf; pop %0; " __ASM_CLAC "\n\t"
> "1:"
> -   : "=rm" (flags) : : "memory", "cc");
> +   : "=r" (flags) : : "memory", "cc");

native_save_fl() has the exact same code; you'll need to fix both.

Re: [PATCH v2 0/5] seqlock: Introduce PREEMPT_RT support

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 08:48:17PM +0800, Boqun Feng wrote:
> I think this happened because seqcount_##lockname##_init() is defined at
> function rather than macro, so when the seqcount_init() gets expand in

Bah! I hate all this :/

I suspect the below, while more verbose than I'd like is the best
option.

---
 include/linux/seqlock.h | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index f73c7eb68f27..76e44e6c0100 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -173,6 +173,19 @@ static inline void seqcount_lockdep_reader_access(const 
seqcount_t *s)
  * @lock:  Pointer to the associated lock
  */
 
+#define seqcount_LOCKNAME_init(s, _lock, lockname) \
+   do {\
+   seqcount_##lockname##_t *s = (s);   \
+   seqcount_init(&s->seqcount);\
+   __SEQ_LOCK(s->lock = (_lock));  \
+   } while (0)
+
+#define seqcount_raw_spinlock_init(s, lock)seqcount_LOCKNAME_init(s, lock, 
raw_spinlock)
+#define seqcount_spinlock_init(s, lock)
seqcount_LOCKNAME_init(s, lock, spinlock)
+#define seqcount_rwlock_init(s, lock)  seqcount_LOCKNAME_init(s, lock, 
rwlock);
+#define seqcount_mutex_init(s, lock)   seqcount_LOCKNAME_init(s, lock, 
mutex);
+#define seqcount_ww_mutex_init(s, lock)
seqcount_LOCKNAME_init(s, lock, ww_mutex);
+
 /*
  * SEQCOUNT_LOCKNAME() - Instantiate seqcount_LOCKNAME_t and helpers
  * seqprop_LOCKNAME_*()- Property accessors for seqcount_LOCKNAME_t
@@ -190,13 +203,6 @@ typedef struct seqcount_##lockname {   
\
__SEQ_LOCK(locktype *lock); \
 } seqcount_##lockname##_t; \
\
-static __always_inline void\
-seqcount_##lockname##_init(seqcount_##lockname##_t *s, locktype *lock) \
-{  \
-   seqcount_init(&s->seqcount);\
-   __SEQ_LOCK(s->lock = lock); \
-}  \
-   \
 static __always_inline seqcount_t *\
 __seqprop_##lockname##_ptr(seqcount_##lockname##_t *s) \
 {  \
@@ -284,8 +290,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, 
&s->lock->base, ww_mu
__SEQ_LOCK(.lock= (assoc_lock)) \
 }
 
-#define SEQCNT_SPINLOCK_ZERO(name, lock)   SEQCOUNT_LOCKNAME_ZERO(name, 
lock)
 #define SEQCNT_RAW_SPINLOCK_ZERO(name, lock)   SEQCOUNT_LOCKNAME_ZERO(name, 
lock)
+#define SEQCNT_SPINLOCK_ZERO(name, lock)   SEQCOUNT_LOCKNAME_ZERO(name, 
lock)
 #define SEQCNT_RWLOCK_ZERO(name, lock) SEQCOUNT_LOCKNAME_ZERO(name, 
lock)
 #define SEQCNT_MUTEX_ZERO(name, lock)  SEQCOUNT_LOCKNAME_ZERO(name, 
lock)
 #define SEQCNT_WW_MUTEX_ZERO(name, lock)   SEQCOUNT_LOCKNAME_ZERO(name, 
lock)

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:

> Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.

How's this?

---
Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
From: Hou Tao 
Date: Tue, 15 Sep 2020 22:07:50 +0800

From: Hou Tao 

The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
that percpu-rwsem is a blocking primitive, should be just fine.

However, file_end_write() is used from IRQ context and will cause
load-store issues.

Fixing it by using the IRQ-safe this_cpu_*() for operations on
read_count. This will generate more expensive code on a number of
platforms, which might cause a performance regression for some of the
other percpu-rwsem users.

If any such is reported, we can consider alternative solutions.

Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes")
Signed-off-by: Hou Tao 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com
---
 include/linux/percpu-rwsem.h  |8 
 kernel/locking/percpu-rwsem.c |4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -60,7 +60,7 @@ static inline void percpu_down_read(stru
 * anything we did within this RCU-sched read-size critical section.
 */
if (likely(rcu_sync_is_idle(&sem->rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
__percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
/*
@@ -79,7 +79,7 @@ static inline bool percpu_down_read_tryl
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(&sem->rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
preempt_enable();
@@ -103,7 +103,7 @@ static inline void percpu_up_read(struct
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(&sem->rss))) {
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
} else {
/*
 * slowpath; reader will only ever wake a single blocked
@@ -115,7 +115,7 @@ static inline void percpu_up_read(struct
 * aggregate zero, as that is the only time it matters) they
 * will also see our critical section.
 */
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
rcuwait_wake_up(&sem->writer);
}
preempt_enable();
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
 static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
 
/*
 * Due to having preemption disabled the decrement happens on
@@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(s
if (likely(!atomic_read_acquire(&sem->block)))
return true;
 
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
 
/* Prod writer to re-evaluate readers_active_check() */
rcuwait_wake_up(&sem->writer);

Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> 
> This commit causes the kernel to oops and reboot when injecting a SLB
> multihit which causes a MCE.
> 
> Before this commit a SLB multihit was corrected by the kernel and the
> system continued to operate normally.
> 
> cc: sta...@vger.kernel.org
> Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI 
> accounting")
> Signed-off-by: Michal Suchanek 

Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
nmi_enter() supports nesting natively.

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 05:11:23PM +0100, Will Deacon wrote:
> On Tue, Sep 15, 2020 at 06:03:44PM +0200, pet...@infradead.org wrote:
> > On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
> > 
> > > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
> > 
> > How's this?
> > 
> > ---
> > Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> > From: Hou Tao 
> > Date: Tue, 15 Sep 2020 22:07:50 +0800
> > 
> > From: Hou Tao 
> > 
> > The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> > that percpu-rwsem is a blocking primitive, should be just fine.
> > 
> > However, file_end_write() is used from IRQ context and will cause
> > load-store issues.
> 
> ... on architectures where the per-cpu accessors are not atomic.

That's not entirely accurate, on x86 for example the per-cpu ops are not
atomic, but they are not susceptible to this problem due to them being a
single instruction from the point of interrupts -- either they wholly
happen or they don't.

So I'd reformulate it like: "... on architectures where the per-cpu
accessors are not natively irq-safe" ?

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 05:31:14PM +0200, Oleg Nesterov wrote:

> > So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from
> > IRQ context is totally out of spec. That said, we've (grudgingly)
> > accomodated them before.
> 
> Yes, I didn't expect percpu_up_ can be called from IRQ :/

Yeah, me neither. That's well out of spec for a blocking primitive in
general.

> > This seems to be a fairly long standing issue, and certainly not unique
> > to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h,
> > should be similarly affected I think). The issue seems to stem from
> > Oleg's original rewrite:
> >
> >   a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers 
> > unnecessarily")
> 
> Not really... I think it was 70fe2f48152e ("aio: fix freeze protection of aio 
> writes").

Ah, that came later? Fair enough, I'll change the Fixes line.

> And iiuc io_uring does the same.

Indeed, I just went through a bunch of the file_end_write() callers.

> > and is certainly an understandable mistake.
> >
> > I'm torn on what to do, using this_cpu over __this_cpu is going to
> > adversely affect code-gen (and possibly performance) for all the
> > percpu-rwsem users that are not quite so 'creative'.
> 
> Yes, but what else can we do?

Well, I just talked about it with Will, there's a bunch of things we
could do, but they're all quite ugly.

My leading alternative was adding: percpu_down_read_irqsafe() /
percpu_up_read_irqsafe(), which use local_irq_save() instead of
preempt_disable().

But blergh.. Will also argued that by going with this patch, we'll get
an affected workload when someone reports a performance regression,
which I suppose is a bonus.

Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 10:07:50PM +0800, Hou Tao wrote:
> Under aarch64, __this_cpu_inc() is neither IRQ-safe nor atomic, so
> when percpu_up_read() is invoked under IRQ-context (e.g. aio completion),
> and it interrupts the process on the same CPU which is invoking
> percpu_down_read(), the decreasement on read_count may lost and
> the final value of read_count on the CPU will be unexpected
> as shown below:

> Fixing it by using the IRQ-safe helper this_cpu_inc|dec() for
> operations on read_count.
> 
> Another plausible fix is to state that percpu-rwsem can NOT be
> used under IRQ context and convert all users which may
> use it under IRQ context.

*groan*...

So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from
IRQ context is totally out of spec. That said, we've (grudgingly)
accomodated them before.

This seems to be a fairly long standing issue, and certainly not unique
to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h,
should be similarly affected I think). The issue seems to stem from
Oleg's original rewrite:

  a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers 
unnecessarily")

and is certainly an understandable mistake.

I'm torn on what to do, using this_cpu over __this_cpu is going to
adversely affect code-gen (and possibly performance) for all the
percpu-rwsem users that are not quite so 'creative'.

Re: WARNING: suspicious RCU usage: race/events/tlb.h:57 suspicious rcu_dereference_check() usage!

2020-09-15 Thread peterz

On Mon, Sep 14, 2020 at 01:29:34PM -0400, Qian Cai wrote:
> On Wed, 2020-09-09 at 10:08 +0530, Naresh Kamboju wrote:
> > While booting x86_64 with Linux next 20200908 tag kernel this warning
> > was noticed.
> 
> This pretty much looks like the same issue in:
> 
> https://lore.kernel.org/lkml/20200902035146.ga45...@roeck-us.net/
> 
> Can you revert the patchset to see if it is related?

Don't bother..  I'll be sending patches soon.

Re: Static call dependency on libelf version?

2020-09-15 Thread peterz

On Tue, Sep 15, 2020 at 12:50:54AM -0700, Hugh Dickins wrote:
> This is just an FYI written from a position of ignorance: I may
> have got it wrong, and my build environment too piecemeal to matter
> to anyone else; but what I saw was weird enough to be worth mentioning,
> in case it saves someone some time.
> 
> I usually build and test on mmotm weekly rather linux-next daily.
> No problem with 5.9-rc3-mm1 from 2020-09-04, nor with 5.9-rc5, but
> (on two machines) 5.9-rc5-mm1 from 2020-09-13 could not link vmlinux:
> 
>   AR  init/built-in.a
>   LD  vmlinux.o
> ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) 
> larger than the file size
> ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) 
> larger than the file size
> ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) 
> larger than the file size
> ld: warning: init/main.o has a corrupt section with a size (7472747368732e00) 
> larger than the file size
> ld: init/built-in.a: member init/main.o in archive is not an object
> make[1]: *** [vmlinux] Error 1
> make: *** [__sub-make] Error 2
> 
> On the third machine, a more recent installation, but using the same
> gcc and the same binutils, I could build the same config successfully.
> init/main.o was the same size on each (49216 bytes), but diff of hd
> of the good against the bad showed:
> 
> 2702,2709c2702,2709
> < 00bfc0 01db 0001 0003   ><
> < 00bfd0   b316   ><
> < 00bfe0 0018     ><
> < 00bff0 0001  0008   ><
> < 00c000 01ee 0004 0040   >@...<
> < 00c010   b330   >0...<
> < 00c020 0090  002d 0030  >-...0...<
> < 00c030 0008  0018   ><
> ---
> > 00bfc0   01f1   ><
> > 00bfd0 79732e00 6261746d 74732e00 62617472  >..symtab..strtab<
> > 00bfe0 68732e00 74727473 2e006261 616c6572  >..shstrtab..rela<
> > 00bff0 7865742e 722e0074 2e616c65 61746164  >.text..rela.data<
> > 00c000 73622e00 722e0073 5f616c65 6172745f  >..bss..rela__tra<
> > 00c010 6f706563 73746e69 7274705f 722e0073  >cepoints_ptrs..r<
> > 00c020 2e616c65 74617473 635f6369 2e6c6c61  >ela.static_call.<
> > 00c030 74786574 65722e00 692e616c 2e74696e  >text..rela.init.<
> 
> and 217 other .os in the build tree also "corrupted".
> 
> CONFIG_HAVE_STATIC_CALL=y
> CONFIG_HAVE_STATIC_CALL_INLINE=y
> stand out as new in the .config for 5.9-rc5-mm1, and references
> to objtool in static_call.h and static_call_types.h took me to
> tools/objtool/Makefile, with its use of libelf.
> 
> I've copied over files of the newer libelf (0.168) to the failing
> machines, which are now building the 5.9-rc5-mm1 vmlinux correctly.
> 
> It looks as if CONFIG_HAVE_STATIC_CALL=y depends on a newer libelf
> than I had before (0.155), and should either insist on a minimum
> version, or else be adjusted to work with older versions.

Hurmph, I have no idea how this happened; clearly none of my machines
have this older libelf :/ (the machines I use most seem to be on 0.180).

I'm also not sure what static_call is doing different from say orc
data generation. Both create and fill sections in similar ways.

Mark, do you have any idea?

Re: [PATCH] x86/unwind/fp: Fix FP unwinding in ret_from_fork

2020-09-15 Thread peterz

On Mon, Sep 14, 2020 at 12:04:22PM -0500, Josh Poimboeuf wrote:
> There have been some reports of "bad bp value" warnings printed by the
> frame pointer unwinder:
> 
>   WARNING: kernel stack regs at 5bac7112 in sh:1014 has bad 'bp' 
> value 
> 
> This warning happens when unwinding from an interrupt in
> ret_from_fork().  If entry code gets interrupted, the state of the frame
> pointer (rbp) may be undefined, which can confuse the unwinder,
> resulting in warnings like the above.
> 
> There's an in_entry_code() check which normally silences such warnings
> for entry code.  But in this case, ret_from_fork() is getting
> interrupted.  It recently got moved out of .entry.text, so the
> in_entry_code() check no longer works.
> 
> We could move it back into .entry.text, but that would break the noinstr
> validation because of the call to schedule_tail().
> 
> Instead, initialize each new task's RBP to point to the task's entry
> regs via an encoded frame pointer.  That will allow the unwinder to
> reach the end of the stack gracefully.
> 
> Fixes: b9f6976bfb94 ("x86/entry/64: Move non entry code into .text section")
> Reported-by: Naresh Kamboju 
> Reported-by: Logan Gunthorpe 
> Signed-off-by: Josh Poimboeuf 

Acked-by: Peter Zijlstra (Intel)

Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event

2020-09-14 Thread peterz

On Mon, Sep 14, 2020 at 12:28:41PM -0300, Arnaldo Carvalho de Melo wrote:

> > >   struct {
> > > struct perf_event_header header;
> 
> > > u32  pid, tid;
> > > u64  addr;
> > > u64  len;
> > > u64  pgoff;
> > > u32  maj;
> > > u32  min;
> > > u64  ino;
> > > u64  ino_generation;
> > > u32  prot, flags;
> > > u32  reserved;
> 
> What for this reserved? its all nicely aligned already, u64 followed by
> two u32 (prot, flags).

I suspect it is so that sizeof(reserve+buildid) is a multiple of 8. But
yes, that's a wee bit daft, since the next field is a variable length
character array.

> > > u8   buildid[20];
>  
> > Do we need maj, min, ino, ino_generation for mmap3 event?
> > I think they are to compare binaries, then we can do it with
> > build-id (and I think it'd be better)..
> 
> Humm, I thought MMAP2 would be a superset of MMAP and MMAP3 would be a
> superset of MMAP2.

Well, the 'funny' thing is that if you do use buildid, then
{maj,min,ino,ino_generation} are indeed superfluous, but are combined
also large enough to contain buildid.

> If we want to ditch useless stuff, then trow away pid, tid too, as we
> can select those via sample_type.

Correct.

So something like:

struct {
  struct perf_event_header header;

  u64  addr;
  u64  len;
  u64  pgoff;
  union {
struct {
  u32  maj;
  u32  min;
  u64  ino;
  u64  ino_generation;
};
u8 buildid[20];
  };
  u32  prot, flags;
  char filename[];
  struct sample_id sample_id;
};

Using one of the MISC bits to resolve the union. Might actually bring
benefit to everyone. Us normal people get to have a smaller MMAP record,
while the buildid folks can have it too.

Even more extreme would be using 2 MISC bits and allowing the union to
be 0 sized for anon.

That said; I have the nagging feeling there were unresolved issues with
mmap2, but I can't seem to find any relevant emails on it :/ My
google-fu is weak today.

Re: [PATCH V6 1/1] Softirq:avoid large sched delay from the pending softirqs

2020-09-14 Thread peterz

On Mon, Sep 14, 2020 at 12:27:35PM +0100, Qais Yousef wrote:
> What does PREEMPT_RT do to deal with softirqs delays?

Makes the lot preemptible, you found the patch below.

> I have tried playing with enabling threadirqs, which AFAIU should make 
> softirqs
> preemptible, right?

Not yet,..

> I realize this patch is still missing from mainline at least:
> 
>   
> https://gitlab.com/kalilinux/packages/linux/blob/a17bad0db9da44cd73f594794a58cc5646393b13/debian/patches-rt/softirq-Add-preemptible-softirq.patch
> 
> Would this be a heavy handed approach to make available for non PREEMPT_RT
> kernels?

Not sure, I suspect it relies on migrate_disable(), which is
preempt_disable() on !RT and then we're back to square one.

> I only worry about potential NET_RX throughput issues. Which by the way is
> protected with preempt_disable currently in mainline. See netif_rx_ni().

So preempt_disable() isn't necessairily a problem, you just want it to
terminate soonish after need_resched() becomes true. Also, I'm having a
wee problem getting from net_rx_action() to netif_rx_ni()

> I am guessing here, but I suspect this NET_RX softirq is one source of big
> delays when network activity is high.

Well, one approach is to more agressively limit how long softirq
processing can run. Current measures are very soft in that regard.

Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks

2020-09-14 Thread peterz

On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> Vincent Guittot (4):
>   sched/fair: relax constraint on task's load during load balance
>   sched/fair: reduce minimal imbalance threshold
>   sched/fair: minimize concurrent LBs between domain level
>   sched/fair: reduce busy load balance interval

I see nothing objectionable there, a little more testing can't hurt, but
I'm tempted to apply them.

Phil, Mel, any chance you can run them through your respective setups?

Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-09-14 Thread peterz

On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
> Reading and modifying current->mm and current->active_mm and switching
> mm should be done with irqs off, to prevent races seeing an intermediate
> state.
> 
> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
> invalidate"). At exec-time when the new mm is activated, the old one
> should usually be single-threaded and no longer used, unless something
> else is holding an mm_users reference (which may be possible).
> 
> Absent other mm_users, there is also a race with preemption and lazy tlb
> switching. Consider the kernel_execve case where the current thread is
> using a lazy tlb active mm:
> 
>   call_usermodehelper()
> kernel_execve()
>   old_mm = current->mm;
>   active_mm = current->active_mm;
>   *** preempt *** >  schedule()
>prev->active_mm = NULL;
>mmdrop(prev active_mm);
>  ...
>   <  schedule()
>   current->mm = mm;
>   current->active_mm = mm;
>   if (!old_mm)
>   mmdrop(active_mm);
> 
> If we switch back to the kernel thread from a different mm, there is a
> double free of the old active_mm, and a missing free of the new one.
> 
> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().
> 
> So as a first step, disable interrupts across the mm/active_mm updates
> to close the lazy tlb preempt race, and provide an arch option to
> extend that to activate_mm which allows architectures doing IPI based
> TLB shootdowns to close the second race.
> 
> This is a bit ugly, but in the interest of fixing the bug and backporting
> before all architectures are converted this is a compromise.
> 
> Signed-off-by: Nicholas Piggin 

Acked-by: Peter Zijlstra (Intel) 

I'm thinking we want this selected on x86 as well. Andy?

> ---
>  arch/Kconfig |  7 +++
>  fs/exec.c| 17 +++--
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..94821e3f94d1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
>   bool
>   depends on MMU_GATHER_TABLE_FREE
>  
> +config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> + bool
> + help
> +   Temporary select until all architectures can be converted to have
> +   irqs disabled over activate_mm. Architectures that do IPI based TLB
> +   shootdowns should enable this.
> +
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>   bool
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index a91003e28eaa..d4fb18baf1fb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
>   }
>  
>   task_lock(tsk);
> - active_mm = tsk->active_mm;
>   membarrier_exec_mmap(mm);
> - tsk->mm = mm;
> +
> + local_irq_disable();
> + active_mm = tsk->active_mm;
>   tsk->active_mm = mm;
> + tsk->mm = mm;
> + /*
> +  * This prevents preemption while active_mm is being loaded and
> +  * it and mm are being updated, which could cause problems for
> +  * lazy tlb mm refcounting when these are updated by context
> +  * switches. Not all architectures can handle irqs off over
> +  * activate_mm yet.
> +  */
> + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> + local_irq_enable();
>   activate_mm(active_mm, mm);
> + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> + local_irq_enable();
>   tsk->mm->vmacache_seqnum = 0;
>   vmacache_flush(tsk);
>   task_unlock(tsk);
> -- 
> 2.23.0
>

Re: [PATCH 1/6] x86: events: Avoid TIF_IA32 when checking 64bit mode

2020-09-14 Thread peterz

On Sat, Sep 12, 2020 at 03:05:48AM -0400, Gabriel Krisman Bertazi wrote:
> In preparation to remove TIF_IA32, stop using it in perf events code.
> 
> Tested by running perf on 32-bit, 64-bit and x32 applications.
> 
> Suggested-by: Andy Lutomirski 
> Signed-off-by: Gabriel Krisman Bertazi 

Acked-by: Peter Zijlstra (Intel)

Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event

2020-09-14 Thread peterz

On Sun, Sep 13, 2020 at 11:02:49PM +0200, Jiri Olsa wrote:
> Add new version of mmap event. The MMAP3 record is an
> augmented version of MMAP2, it adds build id value to
> identify the exact binary object behind memory map:
> 
>   struct {
> struct perf_event_header header;
> 
> u32  pid, tid;
> u64  addr;
> u64  len;
> u64  pgoff;
> u32  maj;
> u32  min;
> u64  ino;
> u64  ino_generation;
> u32  prot, flags;
> u32  reserved;
> u8   buildid[20];
> char filename[];
> struct sample_id sample_id;
>   };
> 

So weren't there still open problems with mmap2 that also needed
addressing? I seem to have forgotten :/

Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event

2020-09-14 Thread peterz

On Sun, Sep 13, 2020 at 11:41:00PM -0700, Stephane Eranian wrote:
> On Sun, Sep 13, 2020 at 2:03 PM Jiri Olsa  wrote:
> what happens if I set mmap3 and mmap2?
> 
> I think using mmap3 for every mmap may be overkill as you add useless
> 20 bytes to an mmap record.
> I am not sure if your code handles the case where mmap3 is not needed
> because there is no buildid, e.g, anonymous memory.
> It seems to me you've written the patch in such a way that if the user
> tool supports mmap3, then it supersedes mmap2, and thus
> you need all the fields of mmap2. But if could be more interesting to
> return either MMAP2 or MMAP3 depending on tool support
> and type of mmap, that would certainly save 20 bytes on any anon mmap.
> But maybe that logic is already in your patch and I missed it.

That, and what if you don't want any of that buildid nonsense at all? I
always kill that because it makes perf pointlessly slow and has
absolutely no upsides for me.

Re: [PATCH V6 1/1] Softirq:avoid large sched delay from the pending softirqs

2020-09-11 Thread peterz

On Fri, Sep 11, 2020 at 05:46:45PM +0100, Qais Yousef wrote:
> On 09/09/20 17:09, qianjun.ker...@gmail.com wrote:
> > From: jun qian 
> > 
> > When get the pending softirqs, it need to process all the pending
> > softirqs in the while loop. If the processing time of each pending
> > softirq is need more than 2 msec in this loop, or one of the softirq
> > will running a long time, according to the original code logic, it
> > will process all the pending softirqs without wakeuping ksoftirqd,
> > which will cause a relatively large scheduling delay on the
> > corresponding CPU, which we do not wish to see. The patch will check
> > the total time to process pending softirq, if the time exceeds 2 ms
> > we need to wakeup the ksofirqd to aviod large sched delay.
> > 
> > Signed-off-by: jun qian 
> 
> In Android there's a patch that tries to avoid schedling an RT task on a cpu
> that is running softirqs. I wonder if this patch helps with this case.
> 
> https://android.googlesource.com/kernel/msm/+/5c3f54c34acf4d9ed01530288d4a98acff815d79%5E%21/#F0
> 
> John, Wei, is this something of interest to you?

Urgh.. that's pretty gross. I think the sane approach is indeed getting
softirqs to react to need_resched() better.

Re: [PATCH 1/2] sched: Fix balance_callback()

2020-09-11 Thread peterz

On Fri, Sep 11, 2020 at 01:17:02PM +0100, Valentin Schneider wrote:
> On 11/09/20 09:17, Peter Zijlstra wrote:
> > The intent of balance_callback() has always been to delay executing
> > balancing operations until the end of the current rq->lock section.
> > This is because balance operations must often drop rq->lock, and that
> > isn't safe in general.
> >
> > However, as noted by Scott, there were a few holes in that scheme;
> > balance_callback() was called after rq->lock was dropped, which means
> > another CPU can interleave and touch the callback list.
> >
> 
> So that can be say __schedule() tail racing with some setprio; what's the
> worst that can (currently) happen here? Something like say two consecutive
> enqueuing of push_rt_tasks() to the callback list?

Yeah, but that isn't in fact the case I worry most about.

What can happen (and what I've spotted once before) is that someone
attempts to enqueue a balance_callback from a rq->lock region that
doesn't handle the calls.

Currently that 'works', that is, it will get ran _eventually_. But
ideally we'd want that to not work and issue a WARN. We want the
callbacks to be timely.

So basically all of these machinations we in order to add the WARN :-)

Re: [PATCH 2/2] sched/hotplug: Ensure only per-cpu kthreads run during hotplug

2020-09-11 Thread peterz

On Fri, Sep 11, 2020 at 01:17:45PM +0100, Valentin Schneider wrote:

> > @@ -6968,6 +7064,8 @@ int sched_cpu_deactivate(unsigned int cp
> >*/
> >   synchronize_rcu();
> >
> > +   balance_push_set(cpu, true);
> > +
> 
> IIUC this is going to make every subsequent finish_lock_switch()
> migrate the switched-to task if it isn't a pcpu kthread. So this is going
> to lead to a dance of
> 
> switch_to() -> switch_to() -> switch_to() -> 
> switch_to() ...
> 
> Wouldn't it be better to batch all those in a migrate_tasks() sibling that
> skips pcpu kthreads?

That's 'difficult', this is hotplug, performance is not a consideration.

Basically we don't have an iterator for the runqueues, so finding these
tasks is hard.

Re: [PATCH] ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms

2020-09-11 Thread peterz

On Fri, Sep 11, 2020 at 03:55:22PM +0300, Adrian Hunter wrote:
> On 11/09/20 2:41 pm, pet...@infradead.org wrote:
> > On Tue, Sep 01, 2020 at 12:16:17PM +0300, Adrian Hunter wrote:
> >> Add synchronize_rcu() after list_del_rcu() in
> >> ftrace_remove_trampoline_from_kallsyms() to protect readers of
> >> ftrace_ops_trampoline_list (in ftrace_get_trampoline_kallsym)
> >> which is used when kallsyms is read.
> >>
> >> Fixes: fc0ea795f53c8d ("ftrace: Add symbols for ftrace trampolines")
> >> Signed-off-by: Adrian Hunter 
> >> ---
> >>  kernel/trace/ftrace.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> >> index 275441254bb5..4e64367c9774 100644
> >> --- a/kernel/trace/ftrace.c
> >> +++ b/kernel/trace/ftrace.c
> >> @@ -2782,6 +2782,7 @@ static void 
> >> ftrace_remove_trampoline_from_kallsyms(struct ftrace_ops *ops)
> >>  {
> >>lockdep_assert_held(&ftrace_lock);
> >>list_del_rcu(&ops->list);
> >> +  synchronize_rcu();
> >>  }
> > 
> > 
> > Hurmph, we've just done a ton of that:
> > 
> > 
> > ftrace_shutdown()
> >   synchronize_rcu_tasks_rude()
> >   ftrace_trampoline_free()
> > ftrace_remove_trampoline_from_kallsyms()
> > 
> > 
> > So would it not be better to move that call before the existing
> > synchronize_rcu_tasks stuff rather than adding another synchronize_rcu()
> > call?
> 
> Doesn't that mean removing the symbol while the trampoline is potentially
> still in use?

Hurm.. potentially yes. OK, lets do this first.

Re: [PATCH V6 1/1] Softirq:avoid large sched delay from the pending softirqs

2020-09-11 Thread peterz

On Wed, Sep 09, 2020 at 05:09:31PM +0800, qianjun.ker...@gmail.com wrote:
> From: jun qian 
> 
> When get the pending softirqs, it need to process all the pending
> softirqs in the while loop. If the processing time of each pending
> softirq is need more than 2 msec in this loop, or one of the softirq
> will running a long time, according to the original code logic, it
> will process all the pending softirqs without wakeuping ksoftirqd,
> which will cause a relatively large scheduling delay on the
> corresponding CPU, which we do not wish to see. The patch will check
> the total time to process pending softirq, if the time exceeds 2 ms
> we need to wakeup the ksofirqd to aviod large sched delay.

But what is all that unreadaable gibberish with pending_new_{flag,bit} ?

Random comments below..


> +#define MAX_SOFTIRQ_TIME_NS 200

2*NSEC_PER_MSEC


> +DEFINE_PER_CPU(__u32, pending_new_flag);
> +DEFINE_PER_CPU(__u32, pending_next_bit);

__u32 is for userspace ABI, this is not it, use u32

> +#define SOFTIRQ_PENDING_MASK ((1UL << NR_SOFTIRQS) - 1)
> +
>  asmlinkage __visible void __softirq_entry __do_softirq(void)
>  {
> - unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> + u64 end = sched_clock() + MAX_SOFTIRQ_TIME_NS;
>   unsigned long old_flags = current->flags;
>   int max_restart = MAX_SOFTIRQ_RESTART;
>   struct softirq_action *h;
>   bool in_hardirq;
> - __u32 pending;
> - int softirq_bit;
> + __u32 pending, pending_left, pending_new;
> + int softirq_bit, next_bit;
> + unsigned long flags;
>  
>   /*
>* Mask out PF_MEMALLOC as the current task context is borrowed for the
> @@ -277,10 +282,33 @@ asmlinkage __visible void __softirq_entry 
> __do_softirq(void)
>  
>   h = softirq_vec;
>  
> - while ((softirq_bit = ffs(pending))) {
> - unsigned int vec_nr;
> + next_bit = per_cpu(pending_next_bit, smp_processor_id());
> + per_cpu(pending_new_flag, smp_processor_id()) = 0;

__this_cpu_read() / __this_cpu_write()

> +
> + pending_left = pending &
> + (SOFTIRQ_PENDING_MASK << next_bit);
> + pending_new = pending &
> + (SOFTIRQ_PENDING_MASK >> (NR_SOFTIRQS - next_bit));

The second mask is the inverse of the first.

> + /*
> +  * In order to be fair, we shold process the pengding bits by the
> +  * last processing order.
> +  */
> + while ((softirq_bit = ffs(pending_left)) ||
> + (softirq_bit = ffs(pending_new))) {
>   int prev_count;
> + unsigned int vec_nr = 0;
>  
> + /*
> +  * when the left pengding bits have been handled, we should
> +  * to reset the h to softirq_vec.
> +  */
> + if (!ffs(pending_left)) {
> + if (per_cpu(pending_new_flag, smp_processor_id()) == 0) 
> {
> + h = softirq_vec;
> + per_cpu(pending_new_flag, smp_processor_id()) = 
> 1;
> + }
> + }
>   h += softirq_bit - 1;
>  
>   vec_nr = h - softirq_vec;
> @@ -298,17 +326,44 @@ asmlinkage __visible void __softirq_entry 
> __do_softirq(void)
>   preempt_count_set(prev_count);
>   }
>   h++;
> - pending >>= softirq_bit;
> +
> + if (ffs(pending_left))

This is the _third_ ffs(pending_left), those things are _expensive_ (on
some archs, see include/asm-generic/bitops/__ffs.h).

> + pending_left >>= softirq_bit;
> + else
> + pending_new >>= softirq_bit;
> +
> + /*
> +  * the softirq's action has been run too much time,
> +  * so it may need to wakeup the ksoftirqd
> +  */
> + if (need_resched() && sched_clock() > end) {
> + /*
> +  * Ensure that the remaining pending bits will be
> +  * handled.
> +  */
> + local_irq_save(flags);
> + if (ffs(pending_left))

*fourth*...

> + or_softirq_pending((pending_left << (vec_nr + 
> 1)) |
> + pending_new);
> + else
> + or_softirq_pending(pending_new << (vec_nr + 1));
> + local_irq_restore(flags);
> + per_cpu(pending_next_bit, smp_processor_id()) = vec_nr 
> + 1;
> + break;
> + }
>   }
>  
> + /* reset the pending_next_bit */
> + per_cpu(pending_next_bit, smp_processor_id()) = 0;
> +
>   if (__this_cpu_read(ksoftirqd) == current)
>   rcu_softirq_qs();
>   local_irq_disable();
>  
>   pending = local_softirq_pending();
>   if (pending) {
> - if (time_before(jiffies, end) && !need_resched() &&

Re: [PATCH] ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms

2020-09-11 Thread peterz

On Tue, Sep 01, 2020 at 12:16:17PM +0300, Adrian Hunter wrote:
> Add synchronize_rcu() after list_del_rcu() in
> ftrace_remove_trampoline_from_kallsyms() to protect readers of
> ftrace_ops_trampoline_list (in ftrace_get_trampoline_kallsym)
> which is used when kallsyms is read.
> 
> Fixes: fc0ea795f53c8d ("ftrace: Add symbols for ftrace trampolines")
> Signed-off-by: Adrian Hunter 
> ---
>  kernel/trace/ftrace.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 275441254bb5..4e64367c9774 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2782,6 +2782,7 @@ static void 
> ftrace_remove_trampoline_from_kallsyms(struct ftrace_ops *ops)
>  {
>   lockdep_assert_held(&ftrace_lock);
>   list_del_rcu(&ops->list);
> + synchronize_rcu();
>  }


Hurmph, we've just done a ton of that:


ftrace_shutdown()
  synchronize_rcu_tasks_rude()
  ftrace_trampoline_free()
ftrace_remove_trampoline_from_kallsyms()


So would it not be better to move that call before the existing
synchronize_rcu_tasks stuff rather than adding another synchronize_rcu()
call?

Re: [PATCH 1/4] cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()

2020-09-11 Thread peterz

On Wed, Sep 02, 2020 at 12:54:41PM +0530, Viresh Kumar wrote:
> + atomic_t reset_pending;

> + atomic_set(&stats->reset_pending, 0);
> + if (atomic_read(&stats->reset_pending))
> + bool pending = atomic_read(&stats->reset_pending);
> + atomic_set(&stats->reset_pending, 1);
> + bool pending = atomic_read(&stats->reset_pending);
> + if (atomic_read(&stats->reset_pending))

What do you think atomic_t is doing for you?

Re: [PATCH v6 6/9] kernel: entry: Support Syscall User Dispatch for common syscall entry

2020-09-11 Thread peterz

On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> use case is emulation (it can be invoked with a different ABI) such that
> seccomp filtering by syscall number doesn't make sense in the first
> place.  In addition, either the syscall is dispatched back to userspace,
> in which case there is no resource for seccomp to protect, or the
> syscall will be executed, and seccomp will execute next.
> 
> Regarding ptrace, I experimented with before and after, and while the
> same ABI argument applies, I felt it was easier to debug if I let ptrace
> happen for syscalls that are dispatched back to userspace.  In addition,
> doing it after ptrace makes the code in syscall_exit_work slightly
> simpler, since it doesn't require special handling for this feature.

I think I'm with Andy that this should be before ptrace(). ptrace()
users will attempt to interpret things like they're regular syscalls,
and that's definitely not the case.

Re: [PATCH v6 5/9] kernel: Implement selective syscall userspace redirection

2020-09-11 Thread peterz

On Fri, Sep 04, 2020 at 04:31:43PM -0400, Gabriel Krisman Bertazi wrote:

> +struct syscall_user_dispatch {
> + char __user *selector;
> + unsigned long dispatcher_start;
> + unsigned long dispatcher_end;
> +};

> +int do_syscall_user_dispatch(struct pt_regs *regs)
> +{
> + struct syscall_user_dispatch *sd = ¤t->syscall_dispatch;
> + unsigned long ip = instruction_pointer(regs);
> + char state;
> +
> + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end))
> + return 0;

If you use {offset,size}, instead of {start,end}, you can write the
above like:

if (ip - sd->dispatcher_offset < sd->dispatcher_size)
return 0;

which is just a single branch.

Re: [PATCH v6 2/9] kernel: entry: Support TIF_SYSCAL_INTERCEPT on common entry code

2020-09-11 Thread peterz

On Fri, Sep 04, 2020 at 04:31:40PM -0400, Gabriel Krisman Bertazi wrote:
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index efebbffcd5cc..72ce9ca860c6 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -21,10 +21,6 @@
>  # define _TIF_SYSCALL_TRACEPOINT (0)
>  #endif
>  
> -#ifndef _TIF_SECCOMP
> -# define _TIF_SECCOMP(0)
> -#endif
> -
>  #ifndef _TIF_SYSCALL_AUDIT
>  # define _TIF_SYSCALL_AUDIT  (0)
>  #endif

Why doesn't this add:

#ifndef _TIF_SYSCALL_INTERCEPT
#define _TIF_SYSCALL_INTERCEPT  (0)
#endif

?

Re: [PATCH v6 1/9] kernel: Support TIF_SYSCALL_INTERCEPT flag

2020-09-11 Thread peterz

On Fri, Sep 04, 2020 at 04:31:39PM -0400, Gabriel Krisman Bertazi wrote:
> +static inline void __set_tsk_syscall_intercept(struct task_struct *tsk,
> +unsigned int type)
> +{
> + tsk->syscall_intercept |= type;
> +
> + if (tsk->syscall_intercept)
> + set_tsk_thread_flag(tsk, TIF_SYSCALL_INTERCEPT);
> +}

Did the above want to be:

unsigned int old = tsk->syscall_intercept;
tsk->syscall_intercept |= type;
if (!old)
set_tsk_thread_flag(tsk, TIF_SYSCALL_INTERCEPT)

?

> +static inline void __clear_tsk_syscall_intercept(struct task_struct *tsk,
> +  unsigned int type)
> +{
> + tsk->syscall_intercept &= ~type;
> +
> + if (tsk->syscall_intercept == 0)
> + clear_tsk_thread_flag(tsk, TIF_SYSCALL_INTERCEPT);
> +}

Re: [PATCH 1/1] kernel/sched:use the enum code replace of the int variable

2020-09-11 Thread peterz

On Mon, Sep 07, 2020 at 09:05:02PM +0800, qianjun.ker...@gmail.com wrote:
> From: jun qian 
> 
> It is hard to understand what the meaning of the value from
> the return value of wakeup_preempt_entity, so I fix it.

> @@ -6822,9 +6828,9 @@ static unsigned long wakeup_gran(struct sched_entity 
> *se)
>   * g
>   *  |<--->|c
>   *
> - *  w(c, s1) = -1
> - *  w(c, s2) =  0
> - *  w(c, s3) =  1
> + *  w(c, s1) = NO_NEED_PREEMPT
> + *  w(c, s2) = MAY_NEED_PREEMPT
> + *  w(c, s3) = NEED_PREEMPT
>   *
>   */

Yeah, I don't think so. The function is a simple C style compare,
where negative is less, 0 is equal and positive is more than.

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread peterz

On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().

kcompactd hotplug fail

2020-09-10 Thread peterz

Hi,

While playing with hotplug, I ran into the below:

[ 2305.676384] [ cut here ]
[ 2305.681543] WARNING: CPU: 1 PID: 15 at kernel/sched/core.c:1924 
__set_cpus_allowed_ptr+0x1bd/0x230
[ 2305.691540] Modules linked in: kvm_intel kvm irqbypass rapl intel_cstate 
intel_uncore
[ 2305.700284] CPU: 1 PID: 15 Comm: cpuhp/1 Tainted: GW 
5.9.0-rc1-00126-g560d2f906d7e-dirty #392
[ 2305.711349] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS 
SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 2305.722803] RIP: 0010:__set_cpus_allowed_ptr+0x1bd/0x230
[ 2305.728732] Code: ba 00 02 00 00 48 c7 c6 20 78 9e 82 4c 89 ef e8 19 ec 5f 
00 85 c0 0f 85 5e ff ff ff 83 bb 60 03 00 00 01 0f 84 51 ff ff ff 90 <0f> 0b 90 
e9 48 ff ff ff 83 bd 10 0a 00 00 02 48 89 5c 24 10 44 89
[ 2305.749687] RSP: :c900033dbdd8 EFLAGS: 00010002
[ 2305.755518] RAX:  RBX: 88842c33cbc0 RCX: 0200
[ 2305.763478] RDX: 0008 RSI: 829e7820 RDI: 83055720
[ 2305.771439] RBP: 88842f43b4c0 R08: 0009 R09: 83055720
[ 2305.779399] R10: 0008 R11:  R12: ffea
[ 2305.787360] R13: 83055720 R14: 000d R15: 00b6
[ 2305.795321] FS:  () GS:88842f48() 
knlGS:
[ 2305.804348] CS:  0010 DS:  ES:  CR0: 80050033
[ 2305.810760] CR2:  CR3: 02810001 CR4: 001706e0
[ 2305.818720] Call Trace:
[ 2305.821454]  kcompactd_cpu_online+0xa1/0xb0
[ 2305.826119]  ? __compaction_suitable+0xa0/0xa0
[ 2305.831079]  cpuhp_invoke_callback+0x9a/0x360
[ 2305.835941]  cpuhp_thread_fun+0x19d/0x220
[ 2305.840414]  ? smpboot_thread_fn+0x1b4/0x280
[ 2305.845178]  ? smpboot_thread_fn+0x26/0x280
[ 2305.849842]  ? smpboot_register_percpu_thread+0xe0/0xe0
[ 2305.855672]  smpboot_thread_fn+0x1d0/0x280
[ 2305.860243]  kthread+0x153/0x170
[ 2305.863843]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 2305.869482]  ret_from_fork+0x22/0x30
[ 2305.873474] irq event stamp: 236
[ 2305.877067] hardirqs last  enabled at (235): [] 
_raw_spin_unlock_irqrestore+0x4c/0x60
[ 2305.887550] hardirqs last disabled at (236): [] 
__schedule+0xc0/0xb10
[ 2305.896482] softirqs last  enabled at (0): [] 
copy_process+0x889/0x1d40
[ 2305.905611] softirqs last disabled at (0): [<>] 0x0
[ 2305.912602] ---[ end trace e7f6c2a95b741e6b ]---


Given:


static int __init kcompactd_init(void)
{
...
ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
"mm/compaction:online",
kcompactd_cpu_online, NULL);

and:


CPUHP_AP_ONLINE_DYN,
CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30,
CPUHP_AP_X86_HPET_ONLINE,
CPUHP_AP_X86_KVM_CLK_ONLINE,
CPUHP_AP_ACTIVE,

this is somewhat expected behaviour.

It tries and set the compaction affinity to include the newly onlined
CPU before it is marked active and that's a no-no.

Ideally the kcompactd notifier is ran after AP_ACTIVE, not before.

Re: kcompactd hotplug fail

2020-09-10 Thread peterz

On Thu, Sep 10, 2020 at 04:10:06PM +0200, pet...@infradead.org wrote:
> Hi,
> 
> While playing with hotplug, I ran into the below:

Ah, it could be I wrecked my kernel bad... :-(

I'll let you know if I can reproduce this on a pristine kernel.

Re: kcompactd hotplug fail

2020-09-10 Thread peterz

On Thu, Sep 10, 2020 at 04:37:45PM +0200, pet...@infradead.org wrote:
> On Thu, Sep 10, 2020 at 04:10:06PM +0200, pet...@infradead.org wrote:
> > Hi,
> > 
> > While playing with hotplug, I ran into the below:
> 
> Ah, it could be I wrecked my kernel bad... :-(
> 
> I'll let you know if I can reproduce this on a pristine kernel.

Yeah, sorry, I'll go search for one of them brown-paper things to wear.

Re: [sparc64] kernel OOPS bisected from "lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state"

2020-09-10 Thread peterz

On Thu, Sep 10, 2020 at 02:43:13PM +0300, Anatoly Pugachev wrote:
> Hello!
> 
> The following git patch 044d0d6de9f50192f9697583504a382347ee95ca
> (linux git master branch) introduced the following kernel OOPS upon
> kernel boot on my sparc64 T5-2 ldom (VM):

https://lkml.kernel.org/r/20200908154157.gv1362...@hirez.programming.kicks-ass.net

Re: [PATCH v2 3/7] arch/x86/amd/ibs: Fix re-arming IBS Fetch

2020-09-10 Thread peterz

On Thu, Sep 10, 2020 at 10:32:23AM +0200, pet...@infradead.org wrote:
> > @@ -363,7 +363,14 @@ perf_ibs_event_update(struct perf_ibs *perf_ibs, 
> > struct perf_event *event,
> >  static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs,
> >  struct hw_perf_event *hwc, u64 config)
> >  {
> > -   wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask);
> > +   u64 _config = (hwc->config | config) & ~perf_ibs->enable_mask;
> > +
> > +   /* On Fam17h, the periodic fetch counter is set when IbsFetchEn is 
> > changed from 0 to 1 */
> > +   if (perf_ibs == &perf_ibs_fetch && boot_cpu_data.x86 >= 0x16 && 
> > boot_cpu_data.x86 <= 0x18)
> > +   wrmsrl(hwc->config_base, _config);

> A better option would be to use hwc->flags, you're loading from that
> line already, so it's guaranteed hot and then you only have a single
> branch. Or stick it in perf_ibs near enable_mask, same difference.

I fixed it for you.

---

struct perf_ibs {
struct pmu pmu;  /* 0   296 */
/* --- cacheline 4 boundary (256 bytes) was 40 bytes ago --- */
unsigned int   msr;  /*   296 4 */

/* XXX 4 bytes hole, try to pack */

u64config_mask;  /*   304 8 */
u64cnt_mask; /*   312 8 */
/* --- cacheline 5 boundary (320 bytes) --- */
u64enable_mask;  /*   320 8 */
u64valid_mask;   /*   328 8 */
u64max_period;   /*   336 8 */
long unsigned int  offset_mask[1];   /*   344 8 */
intoffset_max;   /*   352 4 */
unsigned int   fetch_count_reset_broken:1; /*   356: 0  4 */

/* XXX 31 bits hole, try to pack */

struct cpu_perf_ibs *  pcpu; /*   360 8 */
struct attribute * *   format_attrs; /*   368 8 */
struct attribute_group format_group; /*   37640 */
/* --- cacheline 6 boundary (384 bytes) was 32 bytes ago --- */
const struct attribute_group  * attr_groups[2];  /*   41616 */
u64(*get_count)(u64);/*   432 8 */

/* size: 440, cachelines: 7, members: 15 */
/* sum members: 432, holes: 1, sum holes: 4 */
/* sum bitfield members: 1 bits, bit holes: 1, sum bit holes: 31 bits */
/* last cacheline: 56 bytes */
};

--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -89,6 +89,7 @@ struct perf_ibs {
u64 max_period;
unsigned long   offset_mask[1];
int offset_max;
+   unsigned intfetch_count_reset_broken : 1;
struct cpu_perf_ibs __percpu*pcpu;
 
struct attribute**format_attrs;
@@ -370,7 +371,13 @@ perf_ibs_event_update(struct perf_ibs *p
 static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs,
 struct hw_perf_event *hwc, u64 config)
 {
-   wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask);
+   u64 _config = (hwc->config | config) & ~perf_ibs->enable_mask;
+
+   if (perf_ibs->fetch_count_reset_broken)
+   wrmsrl(hwc->config_base, _config);
+
+   _config |= perf_ibs->enable_mask;
+   wrmsrl(hwc->config_base, _config);
 }
 
 /*
@@ -756,6 +763,13 @@ static __init void perf_event_ibs_init(v
 {
struct attribute **attr = ibs_op_format_attrs;
 
+   /*
+* Some chips fail to reset the fetch count when it is written; instead
+* they need a 0-1 transition of IbsFetchEn.
+*/
+   if (boot_cpu_data.x86 >= 0x16 && boot_cpu_data.x86 <= 0x18)
+   perf_ibs_fetch.fetch_count_reset_broken = 1;
+
perf_ibs_pmu_init(&perf_ibs_fetch, "ibs_fetch");
 
if (ibs_caps & IBS_CAPS_OPCNT) {

Re: [PATCH v2 3/7] arch/x86/amd/ibs: Fix re-arming IBS Fetch

2020-09-10 Thread peterz

On Tue, Sep 08, 2020 at 04:47:36PM -0500, Kim Phillips wrote:
> Stephane Eranian found a bug in that IBS' current Fetch counter was not
> being reset when the driver would write the new value to clear it along
> with the enable bit set, and found that adding an MSR write that would
> first disable IBS Fetch would make IBS Fetch reset its current count.
> 
> Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12,
> 2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when
> IbsFetchEn is changed from 0 to 1."
> 
> Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch,
> so the driver properly resets the internal counter to 0 and IBS
> Fetch starts counting again.
> 
> A family 15h machine tested does not have this problem, and the extra
> wrmsr is also not needed on Family 19h, so only do the extra wrmsr on
> families 16h through 18h.

*groan*...

> ---
>  arch/x86/events/amd/ibs.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
> index 26c36357c4c9..3eb9a55e998c 100644
> --- a/arch/x86/events/amd/ibs.c
> +++ b/arch/x86/events/amd/ibs.c
> @@ -363,7 +363,14 @@ perf_ibs_event_update(struct perf_ibs *perf_ibs, struct 
> perf_event *event,
>  static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs,
>struct hw_perf_event *hwc, u64 config)
>  {
> - wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask);
> + u64 _config = (hwc->config | config) & ~perf_ibs->enable_mask;
> +
> + /* On Fam17h, the periodic fetch counter is set when IbsFetchEn is 
> changed from 0 to 1 */
> + if (perf_ibs == &perf_ibs_fetch && boot_cpu_data.x86 >= 0x16 && 
> boot_cpu_data.x86 <= 0x18)
> + wrmsrl(hwc->config_base, _config);

That's an anti-patttern (and for some reason this is the second time in
a week I've seen it). This is a fairly hot path and you're adding a
whole bunch of loads and branches.

Granted, in case you need the extra wrmsr that's probably noise, but
supposedly you're going to be fixing this in hardware eventually, and
you'll be getting rid of the second wrmsr again. But then you're stuck
with the loads and branches.

A better option would be to use hwc->flags, you're loading from that
line already, so it's guaranteed hot and then you only have a single
branch. Or stick it in perf_ibs near enable_mask, same difference.

> + _config |= perf_ibs->enable_mask;
> + wrmsrl(hwc->config_base, _config);
>  }
>  
>  /*
> -- 
> 2.27.0
>

Re: [PATCH v2 10/11] lockdep: Only trace IRQ edges

2020-09-08 Thread peterz

On Wed, Sep 02, 2020 at 06:48:30AM -0700, Guenter Roeck wrote:
> On 9/2/20 2:12 AM, pet...@infradead.org wrote:
> > On Wed, Sep 02, 2020 at 11:09:35AM +0200, pet...@infradead.org wrote:
> >> On Tue, Sep 01, 2020 at 09:21:37PM -0700, Guenter Roeck wrote:
> >>> [0.00] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4875 
> >>> check_flags.part.39+0x280/0x2a0
> >>> [0.00] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
> >>
> >>> [0.00] [<004cff18>] lock_acquire+0x218/0x4e0
> >>> [0.00] [<00d740c8>] _raw_spin_lock+0x28/0x40
> >>> [0.00] [<009870f4>] p1275_cmd_direct+0x14/0x60
> >>
> >> Lol! yes, I can see that going side-ways... let me poke at that.
> > 
> > I suspect this will do.
> > 
> > diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c
> > index 889aa602f8d8..7cfe88e30b52 100644
> > --- a/arch/sparc/prom/p1275.c
> > +++ b/arch/sparc/prom/p1275.c
> > @@ -38,7 +38,7 @@ void p1275_cmd_direct(unsigned long *args)
> > unsigned long flags;
> > 
> > local_save_flags(flags);
> > -   local_irq_restore((unsigned long)PIL_NMI);
> > +   arch_local_irq_restore((unsigned long)PIL_NMI);
> > raw_spin_lock(&prom_entry_lock);
> > 
> > prom_world(1);
> > 
> No, that doesn't help. Even removing that line entirely doesn't help.
> The problem seems to be that interrupts are not enabled in the first
> place. But why wasn't this a problem before ?

Previously every interrupt opt would disable/enable things, now we only
update state when something actually changes.

Anyway, I'm struggling with qemu-system-sparc64, I've got a sparc64
cross booting to mount, but I'm not seeing this, could you get me your
specific qemu cmdline please?

[PATCH] s390/idle: Fix suspicious RCU usage

2020-09-08 Thread peterz



After commit eb1f00237aca ("lockdep,trace: Expose tracepoints") the
lock tracepoints are visible to lockdep and RCU-lockdep is finding a
bunch more RCU violations that were previously hidden.

Switch the idle->seqcount over to using raw_write_*() to avoid the
lockdep annotation and thus the lock tracepoints.

Reported-by: Guenter Roeck 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/s390/kernel/idle.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -39,14 +39,13 @@ void enabled_wait(void)
local_irq_restore(flags);
 
/* Account time spent with enabled wait psw loaded as idle time. */
-   /* XXX seqcount has tracepoints that require RCU */
-   write_seqcount_begin(&idle->seqcount);
+   raw_write_seqcount_begin(&idle->seqcount);
idle_time = idle->clock_idle_exit - idle->clock_idle_enter;
idle->clock_idle_enter = idle->clock_idle_exit = 0ULL;
idle->idle_time += idle_time;
idle->idle_count++;
account_idle_time(cputime_to_nsecs(idle_time));
-   write_seqcount_end(&idle->seqcount);
+   raw_write_seqcount_end(&idle->seqcount);
 }
 NOKPROBE_SYMBOL(enabled_wait);

Re: [PATCH 00/12] x86/platform/uv: Updates for UV5

2020-09-08 Thread peterz

On Tue, Sep 08, 2020 at 08:28:16AM -0700, Mike Travis wrote:
> I didn't.  If I could figure out how to convert quilt patches into git
> commits I might be able to do that?  (And I didn't know that diffstats were
> needed on the into?)

$ git quiltimport

Or, for the more enterprising person:

$ quilt series | while read file; do git am $file; done

Generating a diffstat from a quilt series (when applied):

$ quilt diff --combine - | diffstat

[PATCH] sparc64: Fix irqtrace warnings on Ultra-S

2020-09-08 Thread peterz

On Tue, Sep 08, 2020 at 07:40:23AM -0700, Guenter Roeck wrote:
> qemu-system-sparc64 -M sun4u -cpu "TI UltraSparc IIi" -m 512 \
> -initrd rootfs.cpio \
> -kernel arch/sparc/boot/image -no-reboot \
> -append "panic=-1 slub_debug=FZPUA rdinit=/sbin/init console=ttyS0" \
> -nographic -monitor none

Thanks I got it. Also enabling DEBUG_LOCKDEP helps (-:

---
Subject: sparc64: Fix irqtrace warnings on Ultra-S

Recent changes in Lockdep's IRQTRACE broke Ultra-S.

In order avoid redundant IRQ state changes, local_irq_restore() lost the
ability to trace a disable. Change the code to use local_irq_save() to
disable IRQs and then use arch_local_irq_restore() to further disable
NMIs.

This result in slightly suboptimal code, but given this code uses a
global spinlock, performance cannot be its primary purpose.

Fixes: 044d0d6de9f5 ("lockdep: Only trace IRQ edges")
Reported-by: Guenter Roeck 
Signed-off-by: Peter Zijlstra (Intel) 
---
A possible alternative would be:

local_save_flags(flags);
arch_local_irq_restore((unsigned long)PIL_NMI);
if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS))
trace_hardirqs_off();

which generates optimal code, but is more verbose.

 arch/sparc/prom/p1275.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c
index 889aa602f8d8..e22233fcf741 100644
--- a/arch/sparc/prom/p1275.c
+++ b/arch/sparc/prom/p1275.c
@@ -37,16 +37,15 @@ void p1275_cmd_direct(unsigned long *args)
 {
unsigned long flags;
 
-   local_save_flags(flags);
-   local_irq_restore((unsigned long)PIL_NMI);
+   local_irq_save(flags);
+   arch_local_irq_restore((unsigned long)PIL_NMI);
raw_spin_lock(&prom_entry_lock);
 
prom_world(1);
prom_cif_direct(args);
prom_world(0);
 
-   raw_spin_unlock(&prom_entry_lock);
-   local_irq_restore(flags);
+   raw_spin_unlock_irqrestore(&prom_entry_lock, flags);
 }
 
 void prom_cif_init(void *cif_handler, void *cif_stack)

Re: [PATCH V2 3/3] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2020-09-08 Thread peterz

On Mon, Sep 07, 2020 at 06:01:15PM +0200, pet...@infradead.org wrote:
> On Fri, Aug 21, 2020 at 12:57:54PM -0700, kan.li...@linux.intel.com wrote:
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index 0f3d01562ded..fa08d810dcd2 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -1440,7 +1440,10 @@ static void x86_pmu_start(struct perf_event *event, 
> > int flags)
> >  
> > cpuc->events[idx] = event;
> > __set_bit(idx, cpuc->active_mask);
> > -   __set_bit(idx, cpuc->running);
> > +   /* The cpuc->running is only used by the P4 PMU */
> > +   if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON) &&
> > +   (boot_cpu_data.x86 == 0xf))
> > +   __set_bit(idx, cpuc->running);
> > x86_pmu.enable(event);
> > perf_event_update_userpage(event);
> >  }
> 
> Yuck! Use a static_branch() or something. This is a gnarly nest of code
> that runs 99.9% of the time to conclude a negative. IOW a complete waste
> of cycles for everybody not running a P4 space heater.

Better still, move it into p4_pmu_enable_event().

Re: [PATCH -next] fork: silence a false postive warning in __mmdrop

2020-09-08 Thread peterz

On Tue, Sep 08, 2020 at 12:50:44PM -0400, Qian Cai wrote:
> > No, you're talking nonsense. We must not free @mm when
> > 'current->active_mm == mm', never.
> 
> Yes, you are right. It still trigger this below on powerpc with today's
> linux-next by fuzzing for a while (saw a few times on recent linux-next before
> as well but so far mostly reproducible on powerpc here). Any idea?

If you can reliably reproduce this, the next thing is to trace mm_count
and figure out where it goes side-ways. I suppose we're looking for an
'extra' decrement.

Mark tried this for a while but gave up because he couldn't reliably
reproduce.

Re: [PATCH v2 4/5] seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support

2020-09-08 Thread peterz

On Fri, Sep 04, 2020 at 05:32:30PM +0200, Ahmed S. Darwish wrote:
> @@ -406,13 +443,20 @@ static inline int read_seqcount_t_retry(const 
> seqcount_t *s, unsigned start)
>   return __read_seqcount_t_retry(s, start);
>  }
>  
> +/*
> + * Enforce non-preemptibility for all seqcount_LOCKNAME_t writers. Don't
> + * do it for PREEMPT_RT, for the reasons outlined at __SEQ_LOCK().
> + */
> +#define __seq_enforce_writer_non_preemptibility(s)   \
> + (!IS_ENABLED(CONFIG_PREEMPT_RT) && __seqcount_lock_preemptible(s))
> +
>  /**
>   * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep
>   * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
>   */
>  #define raw_write_seqcount_begin(s)  \
>  do { \
> - if (__seqcount_lock_preemptible(s)) \
> + if (__seq_enforce_writer_non_preemptibility(s)) \
>   preempt_disable();  \
>   \
>   raw_write_seqcount_t_begin(__seqcount_ptr(s));  \
> @@ -433,7 +477,7 @@ static inline void raw_write_seqcount_t_begin(seqcount_t 
> *s)
>  do { \
>   raw_write_seqcount_t_end(__seqcount_ptr(s));\
>   \
> - if (__seqcount_lock_preemptible(s)) \
> + if (__seq_enforce_writer_non_preemptibility(s)) \
>   preempt_enable();   \
>  } while (0)
>  
> @@ -456,7 +500,7 @@ static inline void raw_write_seqcount_t_end(seqcount_t *s)
>  do { \
>   __seqcount_assert_lock_held(s); \
>   \
> - if (__seqcount_lock_preemptible(s)) \
> + if (__seq_enforce_writer_non_preemptibility(s)) \
>   preempt_disable();  \
>   \
>   write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \
> @@ -483,7 +527,7 @@ static inline void 
> write_seqcount_t_begin_nested(seqcount_t *s, int subclass)
>  do { \
>   __seqcount_assert_lock_held(s); \
>   \
> - if (__seqcount_lock_preemptible(s)) \
> + if (__seq_enforce_writer_non_preemptibility(s)) \
>   preempt_disable();  \
>   \
>   write_seqcount_t_begin(__seqcount_ptr(s));  \
> @@ -504,7 +548,7 @@ static inline void write_seqcount_t_begin(seqcount_t *s)
>  do { \
>   write_seqcount_t_end(__seqcount_ptr(s));\
>   \
> - if (__seqcount_lock_preemptible(s)) \
> + if (__seq_enforce_writer_non_preemptibility(s)) \
>   preempt_enable();   \
>  } while (0)

I've replaced the above with the below, afaict there were no users of
__seqcount_lock_preemptible() left.

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -228,7 +228,11 @@ __seqprop_##lockname##_sequence(const se
 static __always_inline bool\
 __seqprop_##lockname##_preemptible(const seqcount_##lockname##_t *s)   \
 {  \
-   return preemptible; \
+   if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \
+   return preemptible; \
+   \
+   /* PREEMPT_RT relies on the above LOCK+UNLOCK */\
+   return false;   \
 }  \
\
 static __always_inline void\

Re: [PATCH v2 2/5] seqlock: Use unique prefix for seqcount_t property accessors

2020-09-08 Thread peterz

On Fri, Sep 04, 2020 at 05:32:28PM +0200, Ahmed S. Darwish wrote:

>  static __always_inline seqcount_t *  \
> -__seqcount_##lockname##_ptr(seqcount_##lockname##_t *s)  
> \
> +__seqprop_seqcount_##lockname##_ptr(seqcount_##lockname##_t *s)  
> \

I did s/__seqprop_seqcount_/__seqprop_/g on all this.

seqprop is a contraction of seqcount property, no need to put in yet
another seqcount I feel.

Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless

2020-09-08 Thread peterz

On Tue, Sep 08, 2020 at 11:15:14AM +, eddy...@trendmicro.com wrote:
> > From: pet...@infradead.org 
> >
> > I'm now trying and failing to reproduce I can't seem to make it use
> > int3 today. It seems to want to use ftrace or refuses everything. I'm
> > probably doing it wrong.
> >
> 
> You can turn off CONFIG_KPROBES_ON_FTRACE (and also sysctl
> debug.kprobes-optimization) to make it use int3 handler

I'm fairly sure I used the sysctl like in your original reproducer.

I'll try again later.

Re: linux-next: build failure after merge of the tip tree

2020-09-08 Thread peterz

On Tue, Sep 08, 2020 at 07:12:23PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the tip tree, today's linux-next build (powerpc
> allyesconfig) failed like this:
> 
> ERROR: modpost: too long symbol 
> ".__tracepoint_iter_pnfs_mds_fallback_pg_get_mirror_count" 
> [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko]
> 
> Caused by commit
> 
>   d25e37d89dd2 ("tracepoint: Optimize using static_call()")
> 
> Exported symbols need to be <= (64 - sizeof(Elf_Addr)) long.  This is
> presumably 56 on 64 bit arches and the above symbol (including the '.')
> is 56 characters long.

I suppose something like the below ought to cure that. Still, stupid
long tracename that.

---

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 3722a10fc46d..81fa0b2f271e 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -154,7 +154,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #ifdef CONFIG_HAVE_STATIC_CALL
 #define __DO_TRACE_CALL(name)  static_call(tp_func_##name)
 #else
-#define __DO_TRACE_CALL(name)  __tracepoint_iter_##name
+#define __DO_TRACE_CALL(name)  __traceiter_##name
 #endif /* CONFIG_HAVE_STATIC_CALL */
 
 /*
@@ -232,8 +232,8 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * poking RCU a bit.
  */
 #define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
-   extern int __tracepoint_iter_##name(data_proto);\
-   DECLARE_STATIC_CALL(tp_func_##name, __tracepoint_iter_##name); \
+   extern int __traceiter_##name(data_proto);  \
+   DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name);\
extern struct tracepoint __tracepoint_##name;   \
static inline void trace_##name(proto)  \
{   \
@@ -288,19 +288,19 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
static const char __tpstrtab_##_name[]  \
__section(__tracepoints_strings) = #_name;  \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
-   int __tracepoint_iter_##_name(void *__data, proto); \
+   int __traceiter_##_name(void *__data, proto);   \
struct tracepoint __tracepoint_##_name  __used  \
__section(__tracepoints) = {\
.name = __tpstrtab_##_name, \
.key = STATIC_KEY_INIT_FALSE,   \
.static_call_key = &STATIC_CALL_KEY(tp_func_##_name),   \
.static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
-   .iterator = &__tracepoint_iter_##_name, \
+   .iterator = &__traceiter_##_name,   \
.regfunc = _reg,\
.unregfunc = _unreg,\
.funcs = NULL };\
__TRACEPOINT_ENTRY(_name);  \
-   int __tracepoint_iter_##_name(void *__data, proto)  \
+   int __traceiter_##_name(void *__data, proto)\
{   \
struct tracepoint_func *it_func_ptr;\
void *it_func;  \
@@ -314,18 +314,18 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
} while ((++it_func_ptr)->func);\
return 0;   \
}   \
-   DEFINE_STATIC_CALL(tp_func_##_name, __tracepoint_iter_##_name);
+   DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
 
 #define DEFINE_TRACE(name, proto, args)\
DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
 
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
EXPORT_SYMBOL_GPL(__tracepoint_##name); \
-   EXPORT_SYMBOL_GPL(__tracepoint_iter_##name);\
+   EXPORT_SYMBOL_GPL(__traceiter_##name);  \
EXPORT_STATIC_CALL_GPL(tp_func_##name)
 #define EXPORT_TRACEPOINT_SYMBOL(name) \
EXPORT_SYMBOL(__tracepoint_##name); \
-   EXPORT_SYMBOL(__tracepoint_iter_##name);\
+   EXPORT_SYMBOL(__traceiter_##name);  \
EXPORT_STATIC_CALL(tp_func_##name)

Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless

2020-09-08 Thread peterz

On Thu, Sep 03, 2020 at 10:39:54AM +0900, Masami Hiramatsu wrote:

> > There's a bug, that might make it miss it. I have a patch. I'll send it
> > shortly.
> 
> OK, I've confirmed that the lockdep warns on kretprobe from INT3
> with your fix.

I'm now trying and failing to reproduce I can't seem to make it use
int3 today. It seems to want to use ftrace or refuses everything. I'm
probably doing it wrong.

Could you verify the below actually works? It's on top of the first 16
patches.

> Of course make it lockless then warning is gone.
> But even without the lockless patch, this warning can be false-positive
> because we prohibit nested kprobe call, right?

Yes, because the actual nesting is avoided by kprobe_busy, but lockdep
can't tell. Lockdep sees a regular lock user and an in-nmi lock user and
figures that's a bad combination.


---
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1241,48 +1241,47 @@ void recycle_rp_inst(struct kretprobe_in
 }
 NOKPROBE_SYMBOL(recycle_rp_inst);
 
-void kretprobe_hash_lock(struct task_struct *tsk,
-struct hlist_head **head, unsigned long *flags)
-__acquires(hlist_lock)
-{
-   unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
-   raw_spinlock_t *hlist_lock;
-
-   *head = &kretprobe_inst_table[hash];
-   hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_lock_irqsave(hlist_lock, *flags);
-}
-NOKPROBE_SYMBOL(kretprobe_hash_lock);
-
 static void kretprobe_table_lock(unsigned long hash,
 unsigned long *flags)
 __acquires(hlist_lock)
 {
raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_lock_irqsave(hlist_lock, *flags);
+   /*
+* HACK, due to kprobe_busy we'll never actually recurse, make NMI
+* context use a different lock class to avoid it reporting.
+*/
+   raw_spin_lock_irqsave_nested(hlist_lock, *flags, !!in_nmi());
 }
 NOKPROBE_SYMBOL(kretprobe_table_lock);
 
-void kretprobe_hash_unlock(struct task_struct *tsk,
-  unsigned long *flags)
+static void kretprobe_table_unlock(unsigned long hash,
+  unsigned long *flags)
 __releases(hlist_lock)
 {
+   raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash);
+   raw_spin_unlock_irqrestore(hlist_lock, *flags);
+}
+NOKPROBE_SYMBOL(kretprobe_table_unlock);
+
+void kretprobe_hash_lock(struct task_struct *tsk,
+struct hlist_head **head, unsigned long *flags)
+__acquires(hlist_lock)
+{
unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
-   raw_spinlock_t *hlist_lock;
 
-   hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_unlock_irqrestore(hlist_lock, *flags);
+   *head = &kretprobe_inst_table[hash];
+   kretprobe_table_lock(hash, flags);
 }
-NOKPROBE_SYMBOL(kretprobe_hash_unlock);
+NOKPROBE_SYMBOL(kretprobe_hash_lock);
 
-static void kretprobe_table_unlock(unsigned long hash,
-  unsigned long *flags)
+void kretprobe_hash_unlock(struct task_struct *tsk,
+  unsigned long *flags)
 __releases(hlist_lock)
 {
-   raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_unlock_irqrestore(hlist_lock, *flags);
+   unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
+   kretprobe_table_unlock(hash, flags);
 }
-NOKPROBE_SYMBOL(kretprobe_table_unlock);
+NOKPROBE_SYMBOL(kretprobe_hash_unlock);
 
 struct kprobe kprobe_busy = {
.addr = (void *) get_kprobe,

Re: [RFC PATCH] sched: only issue an audit on privileged operation

2020-09-08 Thread peterz

On Fri, Sep 04, 2020 at 06:00:31PM +0200, Christian Göttsche wrote:
> sched_setattr(2) does via kernel/sched/core.c:__sched_setscheduler()
> issue a CAP_SYS_NICE audit event unconditionally, even when the requested
> operation does not require that capability / is un-privileged.
> 
> Perform privilged/unprivileged catigorization first and perform a
> capable test only if needed.
> 
> Signed-off-by: Christian Göttsche 
> ---
>  kernel/sched/core.c | 65 -
>  1 file changed, 47 insertions(+), 18 deletions(-)

So who sodding cares about audit, and why is that a reason to make a
trainwreck of code?

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2020-09-07 Thread peterz

On Mon, Sep 07, 2020 at 11:48:45AM +0100, Qais Yousef wrote:
> IMHO the above is a hack. Out-of-tree modules should rely on public headers 
> and
> exported functions only. What you propose means that people who want to use
> these tracepoints in meaningful way must have a prebuilt kernel handy. Which 
> is
> maybe true for us who work in the embedded world. But users who run normal
> distro kernels (desktop/servers) will fail to build against

But this isn't really aimed at regular users. We're aiming this at
developers (IIUC) so I dont really see this as a problem.

> FWIW, I did raise this concern with Peter in 2019 OSPM and he was okay with 
> the
> exports as it's still not a contract and they can disappear anytime we want.
> Migrating to using BTF is the right way forward IMO. I don't think what we 
> have
> here is out-of-control yet. Though I agree they're annoying.

Right, we're hiding behind the explicit lack of ABI for modules.

Anyway, CTF/BTF/random other crap that isn't DWARFs should work fine to
replace all this muck. Just no idea what the state of any of that is.

Re: [PATCH v1 6/8] x86/tsc: Use seqcount_latch_t

2020-09-07 Thread peterz

On Mon, Sep 07, 2020 at 06:29:13PM +0200, Ahmed S. Darwish wrote:

> I've been unsuccessful in reproducing this huge, 200+ bytes, difference.
> Can I please get the defconfig and GCC version?

I think I lost the config and it's either gcc-9.3 or gcc-10, I can't
remember.

I just tried with:

  make defconfig
  ./scripts/config --enable PREEMPT --enable DEBUG_ATOMIC_SLEEP
  make oldconfig

And that reproduces things a little, but nowhere near as horrible as I
reported. Clearly I had something mad enabled by accident.

> Here are the two competing implementations:
> 
> noinline void cyc2ns_read_begin_v1(struct cyc2ns_data *data)
> {
>   seqcount_latch_t *seqcount;
>   int seq, idx;
> 
>   preempt_disable_notrace();
> 
>   seqcount = &this_cpu_ptr(&cyc2ns)->seq;
>   do {
>   seq = raw_read_seqcount_latch(seqcount);
>   idx = seq & 1;
> 
>   data->cyc2ns_offset = 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
>   data->cyc2ns_mul= 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
>   data->cyc2ns_shift  = 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
> 
>   } while (read_seqcount_latch_retry(seqcount, seq));
> }
> 
> noinline void cyc2ns_read_begin_v2(struct cyc2ns_data *data)
> {
>   int seq, idx;
> 
>   preempt_disable_notrace();
> 
>   do {
>   seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
>   idx = seq & 1;
> 
>   data->cyc2ns_offset = 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
>   data->cyc2ns_mul= 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
>   data->cyc2ns_shift  = 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
> 
>   } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
> }

Don't look at this function in isolation, look at native_sched_clock()
where it's used as a whole.

What happened (afaict) is that the change caused it to use more
registers and ended up spiling crap on the stack.

GCC-9.3 gives me:

(this_cpu variant)


 0c00 :
  c00:  e9 65 00 00 00  jmpq   c6a 
0005  c05:  0f 31   rdtsc  
0007  c07:  48 c1 e2 20 shl$0x20,%rdx
000b  c0b:  48 09 c2or %rax,%rdx
000e  c0e:  65 ff 05 00 00 00 00incl   %gs:0x0(%rip)# c15 

0011c11: R_X86_64_PC32  __preempt_count-0x4
0015  c15:  65 44 8b 05 00 00 00mov%gs:0x0(%rip),%r8d# c1d 

001c  c1c:  00 
0019c19: R_X86_64_PC32  
.data..percpu..shared_aligned+0x1c
001d  c1d:  44 89 c0mov%r8d,%eax
0020  c20:  83 e0 01and$0x1,%eax
0023  c23:  48 c1 e0 04 shl$0x4,%rax
0027  c27:  48 8d 88 00 00 00 00lea0x0(%rax),%rcx
002ac2a: R_X86_64_32S   .data..percpu..shared_aligned
002e  c2e:  65 48 8b 79 08  mov%gs:0x8(%rcx),%rdi
0033  c33:  65 8b b0 00 00 00 00mov%gs:0x0(%rax),%esi
0036c36: R_X86_64_32S   .data..percpu..shared_aligned
003a  c3a:  65 8b 49 04 mov%gs:0x4(%rcx),%ecx
003e  c3e:  65 8b 05 00 00 00 00mov%gs:0x0(%rip),%eax# c45 

0041c41: R_X86_64_PC32  
.data..percpu..shared_aligned+0x1c
0045  c45:  41 39 c0cmp%eax,%r8d
0048  c48:  75 cb   jnec15 
004a  c4a:  89 f0   mov%esi,%eax
004c  c4c:  48 f7 e2mul%rdx
004f  c4f:  48 0f ad d0 shrd   %cl,%rdx,%rax
0053  c53:  48 d3 eashr%cl,%rdx
0056  c56:  f6 c1 40test   $0x40,%cl
0059  c59:  48 0f 45 c2 cmovne %rdx,%rax
005d  c5d:  48 01 f8add%rdi,%rax
0060  c60:  65 ff 0d 00 00 00 00decl   %gs:0x0(%rip)# c67 

0063c63: R_X86_64_PC32  __preempt_count-0x4
0067  c67:  74 1a   je c83 
0069  c69:  c3  retq   
006a  c6a:  48 69 05 00 00 00 00imul   $0xf4240,0x0(%rip),%rax# 
c75 
0071  c71:  40 42 0f 00 
006dc6d: R_X86_64_PC32  jiffies_64-0x8
0075  c75:  48 ba 00 b8 64 d9 05movabs $0xfff0be05d964b800,%rdx
007c  c7c:  be f0 ff 
007f  c7f:  48 01 d0add%rdx,%rax
0082  c82:  c3  retq   
0083  c83:  55  push   %rbp
0084  c84:  48 89 e5mov%rsp,%rbp
0087  c87:  e8 00 00 00 00  callq  c8c 
0088c88: R_X86_64_PLT32 
preempt_schedule_notrace_thunk-0x4
008c  c8c:  5d  pop%rbp
008d  c8d:  c3  retq   



(seqcount_latch variant)


 0c20 :
  c20:  e9 89 00 00 00  jmpq   cae 
0005  c25:  55

Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

2020-09-07 Thread peterz

On Mon, Sep 07, 2020 at 02:03:09PM +0200, Joerg Vehlow wrote:
> 
> 
> On 9/7/2020 1:46 PM, pet...@infradead.org wrote:
> > I think it's too complicated for that is needed, did you see my
> > suggestion from a year ago? Did i miss something obvious?
> > 
> This one? 
> https://lore.kernel.org/linux-fsdevel/20191219090535.gv2...@hirez.programming.kicks-ass.net/
> 
> I think it may be a bit incorrect?
> According to the original comment in __crash_kexec, the mutex was used to
> prevent a sys_kexec_load, while crash_kexec is executed. Your proposed patch
> does not lock the mutex in crash_kexec.

Sure, but any mutex taker will (spin) wait for panic_cpu==CPU_INVALID.
And if the mutex is already held, we'll not run __crash_kexec() just
like the trylock() would do today.

> This does not cover the original use
> case anymore. The only thing that is protected now are two panicing cores at
> the same time.

I'm not following. AFAICT it does exactly what the old code did.
Although maybe I didn't replace all kexec_mutex users, I now see that
thing isn't static.

> Actually, this implementation feels even more hacky to me

It's more minimal ;-) It's simpler in that it only provides the required
semantics (as I understand them) and does not attempt to implement a
more general trylock() like primitive that isn't needed.

Also, read the kexec_lock() implementation you posted and explain to me
what happens when kexec_busy is elevated. Also note the lack of
confusing loops in my code.

Re: [PATCH V2 3/3] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2020-09-07 Thread peterz

On Fri, Aug 21, 2020 at 12:57:54PM -0700, kan.li...@linux.intel.com wrote:
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 0f3d01562ded..fa08d810dcd2 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1440,7 +1440,10 @@ static void x86_pmu_start(struct perf_event *event, 
> int flags)
>  
>   cpuc->events[idx] = event;
>   __set_bit(idx, cpuc->active_mask);
> - __set_bit(idx, cpuc->running);
> + /* The cpuc->running is only used by the P4 PMU */
> + if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON) &&
> + (boot_cpu_data.x86 == 0xf))
> + __set_bit(idx, cpuc->running);
>   x86_pmu.enable(event);
>   perf_event_update_userpage(event);
>  }

Yuck! Use a static_branch() or something. This is a gnarly nest of code
that runs 99.9% of the time to conclude a negative. IOW a complete waste
of cycles for everybody not running a P4 space heater.

> @@ -1544,6 +1547,9 @@ static void x86_pmu_del(struct perf_event *event, int 
> flags)
>   if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
>   goto do_del;
>  
> + if (READ_ONCE(x86_pmu.attr_rdpmc) && x86_pmu.sched_task &&
> + test_bit(event->hw.idx, cpuc->active_mask))
> + __set_bit(event->hw.idx, cpuc->dirty);

And that too seems like an overly complicated set of tests and branches.
This should be effectivly true for the 99% common case.

> @@ -2219,11 +2225,45 @@ static int x86_pmu_event_init(struct perf_event 
> *event)
>   return err;
>  }
>  
> +void x86_pmu_clear_dirty_counters(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + int i;
> +
> + if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
> + return;
> +
> +  /* Don't need to clear the assigned counter. */
> + for (i = 0; i < cpuc->n_events; i++)
> + __clear_bit(cpuc->assign[i], cpuc->dirty);
> +
> + for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) {
> + /* Metrics events don't have corresponding HW counters. */
> + if (is_metric_idx(i))
> + continue;
> + else if (i >= INTEL_PMC_IDX_FIXED)
> + wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - 
> INTEL_PMC_IDX_FIXED), 0);
> + else
> + wrmsrl(x86_pmu_event_addr(i), 0);
> + }
> +
> + bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX);
> +}

The bitmap_{empty,zero}() do indeed compile into single instructions,
neat!

>  static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct 
> *mm)
>  {
>   if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
>   return;
>  
> + /*
> +  * Enable sched_task() for the RDPMC task,
> +  * and clear the existing dirty counters.
> +  */
> + if (x86_pmu.sched_task && event->hw.target && 
> !is_sampling_event(event)) {
> + perf_sched_cb_inc(event->ctx->pmu);
> + x86_pmu_clear_dirty_counters();
> + }

I'm failing to see the correctness of the !is_sampling_event() part
there.

>   /*
>* This function relies on not being called concurrently in two
>* tasks in the same mm.  Otherwise one task could observe
> @@ -2246,6 +2286,9 @@ static void x86_pmu_event_unmapped(struct perf_event 
> *event, struct mm_struct *m
>   if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
>   return;
>  
> + if (x86_pmu.sched_task && event->hw.target && !is_sampling_event(event))
> + perf_sched_cb_dec(event->ctx->pmu);
> +

Idem.

>   if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
>   on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
>  }
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index c72e4904e056..e67713bfa33a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4166,11 +4166,39 @@ static void intel_pmu_cpu_dead(int cpu)
>   intel_cpuc_finish(&per_cpu(cpu_hw_events, cpu));
>  }
>  
> +static void intel_pmu_rdpmc_sched_task(struct perf_event_context *ctx)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + struct perf_event *event;
> +
> + if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
> + return;
> +
> + /*
> +  * If the new task has the RDPMC enabled, clear the dirty counters to
> +  * prevent the potential leak. If the new task doesn't have the RDPMC
> +  * enabled, do nothing.
> +  */
> + list_for_each_entry(event, &ctx->event_list, event_entry) {
> + if (event->hw.target &&
> + (event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED) &&
> + !is_sampling_event(event) &&
> + atomic_read(&event->mmap_count))
> + break;
> + }
> + if (&event->event_entry == &ctx->event_list)
> + return;

That's horrific, what's wrong with something like:

if (!atomic_re

Re: Requirements to control kernel isolation/nohz_full at runtime

2020-09-07 Thread peterz



(your mailer broke and forgot to keep lines shorter than 78 chars)

On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:

> == TIF_NOHZ ==
> 
> Need to get rid of that in order not to trigger syscall slowpath on
> CPUs that don't want nohz_full.  Also we don't want to iterate all
> threads and clear the flag when the last nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does
> that well.

Build on the common entry code I suppose. Then any arch that uses that
gets to have the new features.

> == Proper entry code ==
> 
> We must make sure that a given arch never calls exception_enter() /
> exception_exit().  This saves the previous state of context tracking
> and switch to kernel mode (from context tracking POV) temporarily.
> Since this state is saved on the stack, this prevents us from turning
> off context tracking entirely on a CPU: The tracking must be done on
> all CPUs and that takes some cycles.
> 
> This means that, considering early entry code (before the call to
> context tracking upon kernel entry, and after the call to context
> tracking upon kernel exit), we must take care of few things:
> 
> 1) Make sure early entry code can't trigger exceptions. Or if it does,
> the given exception can't schedule or use RCU (unless it calls
> rcu_nmi_enter()). Otherwise the exception must call
> exception_enter()/exception_exit() which we don't want.

I think this is true for x86. Early entry has interrupts disabled, any
exception that can still happen is NMI-like and will thus use
rcu_nmi_enter().

On x86 that now includes #DB (which is also excluded due to us refusing
to set execution breakpoints on entry code), #BP, NMI and MCE.

> 2) No call to schedule_user().

I'm not sure what that is supposed to do, but x86 doesn't appear to have
it, so all good :-)

> 3) Make sure early entry code is not interruptible or
> preempt_schedule_irq() would rely on
> exception_entry()/exception_exit()

This is so for x86.

> 4) Make sure early entry code can't be traced (no call to
> preempt_schedule_notrace()), or if it does it can't schedule

noinstr is your friend.

> I believe x86 does most of that well.

It does now.

Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

2020-09-07 Thread peterz

On Mon, Sep 07, 2020 at 12:51:37PM +0200, Joerg Vehlow wrote:
> Hi,
> 
> I guess there is currently no other way than to use something like Steven
> proposed. I implemented and tested the attached patch with a module,
> that triggers the soft lockup detection and it works as expected.
> I did not use inline functions, but normal function implemented in
> kexec_core,
> because there is nothing time critical here.
> I also added the mutex_lock to the trylock variant, because then the unlock
> function can be the same for both lock functions.
> 
> What do you think?

I think it's too complicated for that is needed, did you see my
suggestion from a year ago? Did i miss something obvious?

Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

2020-09-07 Thread peterz

On Sat, Aug 22, 2020 at 07:49:28PM -0400, Steven Rostedt wrote:

> From this email:
> 
> > The problem happens when that owner is the idle task, this can happen
> > when the irq/softirq hits the idle task, in that case the contending
> > mutex_lock() will try and PI boost the idle task, and that is a big
> > no-no.
> 
> What's wrong with priority boosting the idle task? It's not obvious,
> and I can't find comments in the code saying it would be bad.

> The idle task is not mentioned at all in rtmutex.c and not mentioned in
> kernel/locking except for some comments about RCU in lockdep.

There used to be a giant BUG() and comment somewhere in the PI code I
think.. but that's vage memories.

> I see that in the idle code the prio_change method does a BUG(), but
> there's no comment to say why it does so.
> 
> The commit that added that BUG, doesn't explain why it can't happen:
> 
> a8941d7ec8167 ("sched: Simplify the idle scheduling class")

That's like almost a decade ago ...

> I may have once known the rationale behind all this, but it's been a
> long time since I worked on the PI code, and it's out of my cache.

I suffer much the same problem.

So cenceptually there's the problem that idle must always be runnable,
and the moment you boost it, it becomes subject to a different
scheduling class.

Imagine for example what happens when we boost it to RT and then have it
be subject to throttling. What are we going to run when the idle task
is no longer elegible to run.

(it might all work out by accident, but ISTR we had a whole bunch of fun
in the earlier days of RT due to things like that)

Re: [PATCH -v2] scipts/tags.sh: Add custom sort order

2020-09-04 Thread peterz

On Thu, Sep 03, 2020 at 09:26:04AM +0200, pet...@infradead.org wrote:
> On Thu, Sep 03, 2020 at 11:07:28AM +0900, Masahiro Yamada wrote:

> > Will re-implementing your sorting logic
> > in bash look cleaner?
> 
> Possibly, I can try, we'll see.

It is somewhat cleaner, but it is _abysmally_ slow. Bash sucks :-(

It is still broken in all the same ways as before, I figured I'd get it
'working' first.

---
diff --git a/scripts/tags.sh b/scripts/tags.sh
index 32d3f53af10b..ec2688b3441a 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -239,10 +239,65 @@ setup_regex()
done
 }
 
+sort_tags()
+{
+   export LC_ALL=C
+
+   # start concurrent sort
+   coproc sort
+   # HACK, clone sort output into 3 to ensure we can still read it
+   # after sort terminates
+   exec 3<&${COPROC[0]}
+
+   while read tag file rest;
+   do
+   local tmp=${rest#*;\"}
+
+   case "${tmp:1:1}" in # Precedence for 'C' kinds
+
+   c) order="A";; # classes
+   s) order="B";; # structure names
+   t) order="C";; # typedefs
+   g) order="D";; # enumeration names
+   u) order="E";; # union names
+   n) order="F";; # namespaces
+
+   f) order="G";; # function definitions
+   p) order="H";; # function prototypes
+   d) order="I";; # macro definitions
+
+   e) order="J";; # enumerators (values inside an enumeration)
+   m) order="K";; # class, struct and union members
+   v) order="L";; # variable definitions
+
+   l) order="M";; # local variables [off]
+   x) order="N";; # external and forward variable declarations
+
+   *) order="Z";;
+
+   esac
+
+   # write to sort with a new sort-key prepended
+   echo "${tag}${order}${tag}  ${file} ${rest}" >&${COPROC[1]}
+   done
+
+   # close sort input
+   exec {COPROC[1]}>&-
+
+   # consume sort output
+   while read -u 3 key line;
+   do
+   # strip the sort-key
+   echo "${line}"
+   done
+}
+
 exuberant()
 {
+   (
+
setup_regex exuberant asm c
-   all_target_sources | xargs $1 -a\
+   all_target_sources | xargs $1   \
-I __initdata,__exitdata,__initconst,__ro_after_init\
-I __initdata_memblock  \
-I __refdata,__attribute,__maybe_unused,__always_unused \
@@ -256,12 +311,16 @@ exuberant()
-I DEFINE_TRACE,EXPORT_TRACEPOINT_SYMBOL,EXPORT_TRACEPOINT_SYMBOL_GPL \
-I static,const \
--extra=+fq --c-kinds=+px --fields=+iaS --langmap=c:+.h \
+   --sort=no -o -  \
"${regex[@]}"
 
setup_regex exuberant kconfig
-   all_kconfigs | xargs $1 -a  \
+   all_kconfigs | xargs $1 \
+   --sort=no -o -  \
--langdef=kconfig --language-force=kconfig "${regex[@]}"
 
+   ) | sort_tags > tags
+
 }
 
 emacs()

Re: [PATCH] MAINTAINERS: Add myself as SCHED_DEADLINE reviewer

2020-09-04 Thread peterz

On Fri, Sep 04, 2020 at 11:26:23AM +0200, Daniel Bristot de Oliveira wrote:

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 592467ba3f4d..56d185210a43 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -15363,6 +15363,7 @@ R:Dietmar Eggemann  
> (SCHED_NORMAL)
>  R:   Steven Rostedt  (SCHED_FIFO/SCHED_RR)
>  R:   Ben Segall  (CONFIG_CFS_BANDWIDTH)
>  R:   Mel Gorman  (CONFIG_NUMA_BALANCING)
> +R:   Daniel Bristot de Oliveira  (SCHED_DEADLINE)
>  L:   linux-kernel@vger.kernel.org
>  S:   Maintained
>  T:   git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

Absolutely, great to have you looking over that!

Re: [PATCH v2 05/28] objtool: Add a pass for generating __mcount_loc

2020-09-04 Thread peterz

On Thu, Sep 03, 2020 at 03:03:30PM -0700, Sami Tolvanen wrote:
> On Thu, Sep 3, 2020 at 2:51 PM Kees Cook  wrote:
> >
> > On Thu, Sep 03, 2020 at 01:30:30PM -0700, Sami Tolvanen wrote:
> > > From: Peter Zijlstra 
> > >
> > > Add the --mcount option for generating __mcount_loc sections
> > > needed for dynamic ftrace. Using this pass requires the kernel to
> > > be compiled with -mfentry and CC_USING_NOP_MCOUNT to be defined
> > > in Makefile.
> > >
> > > Link: 
> > > https://lore.kernel.org/lkml/20200625200235.gq4...@hirez.programming.kicks-ass.net/
> > > Signed-off-by: Peter Zijlstra 
> >
> > Hmm, I'm not sure why this hasn't gotten picked up yet. Is this expected
> > to go through -tip or something else?
> 
> Note that I picked up this patch from Peter's original email, to which
> I included a link in the commit message, but it wasn't officially
> submitted as a patch. However, the previous discussion seems to have
> died, so I included the patch in this series, as it cleanly solves the
> problem of whitelisting non-call references to __fentry__. I was
> hoping for Peter and Steven to comment on how they prefer to proceed
> here.

Right; so I'm obviously fine with this patch and I suppose I can pick it
(and the next) into tip/objtool/core, provided Steve is okay with this
approach.

Re: [PATCH v2 00/28] Add support for Clang LTO

2020-09-04 Thread peterz



Please don't nest series!

Start a new thread for every posting.

1 2 3 4 5 >

1 - 100 of 423 matches

Mail list logo