from:"Marcelo Tosatti"

[PATCH v6] hrtimer: avoid retrigger_next_event IPI

2021-04-19 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v6:
  - Do not take softirq_raised into account (Peter Xu).
  - Include BOOTTIME as base that requires IPI (Thomas).
  - Unconditional reprogram on resume path, since there is
nothing to gain in such path anyway.

v5:
  - Add missing hrtimer_update_base (Peter Xu).

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).


diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index bb5e7b0a4274..14a6e449b221 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -318,7 +318,7 @@ struct clock_event_device;
 
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
-extern void clock_was_set_delayed(void);
+extern void clock_was_set_delayed(bool force_reprogram);
 
 extern unsigned int hrtimer_resolution;
 
@@ -326,7 +326,7 @@ extern unsigned int hrtimer_resolution;
 
 #define hrtimer_resolution (unsigned int)LOW_RES_NSEC
 
-static inline void clock_was_set_delayed(void) { }
+static inline void clock_was_set_delayed(bool force_reprogram) { }
 
 #endif
 
@@ -351,7 +351,7 @@ hrtimer_expires_remaining_adjusted(const struct hrtimer 
*timer)
timer->base->get_time());
 }
 
-extern void clock_was_set(void);
+extern void clock_was_set(bool);
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
 #else
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..2258782fd714 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -758,9 +758,17 @@ static void hrtimer_switch_to_hres(void)
retrigger_next_event(NULL);
 }
 
+static void clock_was_set_force_reprogram_work(struct work_struct *work)
+{
+   clock_was_set(true);
+}
+
+static DECLARE_WORK(hrtimer_force_reprogram_work, 
clock_was_set_force_reprogram_work);
+
+
 static void clock_was_set_work(struct work_struct *work)
 {
-   clock_was_set();
+   clock_was_set(false);
 }
 
 static DECLARE_WORK(hrtimer_work, clock_was_set_work);
@@ -769,9 +777,12 @@ static DECLARE_WORK(hrtimer_work, clock_was_set_work);
  * Called from timekeeping and resume code to reprogram the hrtimer
  * interrupt device on all cpus.
  */
-void clock_was_set_delayed(void)
+void clock_was_set_delayed(bool force_reprogram)
 {
-   schedule_work(_work);
+   if (force_reprogram)
+   schedule_work(_force_reprogram_work);
+   else
+   schedule_work(_work);
 }
 
 #else
@@ -871,6 +882,18 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT) |\
+(1U << HRTIMER_BASE_BOOTTIME) |\
+(1U << HRTIMER_BASE_BOOTTIME_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -882,11 +905,42 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
  * resolution timer interrupts. On UP we just disable interrupts and
  * call the high resolution interrupt code.
  */
-void clock_was_set(void)
+void clock_was_set(bool force_reprogram)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (force_reprogram == true) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+

Re: [PATCH v5] hrtimer: avoid retrigger_next_event IPI

2021-04-19 Thread Marcelo Tosatti

On Sat, Apr 17, 2021 at 06:51:08PM +0200, Thomas Gleixner wrote:
> On Sat, Apr 17 2021 at 18:24, Thomas Gleixner wrote:
> > On Fri, Apr 16 2021 at 13:13, Peter Xu wrote:
> >> On Fri, Apr 16, 2021 at 01:00:23PM -0300, Marcelo Tosatti wrote:
> >>>  
> >>> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) | \
> >>> +  (1U << HRTIMER_BASE_REALTIME_SOFT) |   \
> >>> +  (1U << HRTIMER_BASE_TAI) | \
> >>> +  (1U << HRTIMER_BASE_TAI_SOFT))
> >>> +
> >>> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> >>> +{
> >>> + if (cpu_base->softirq_activated)
> >>> + return true;
> >>
> >> A pure question on whether this check is needed...
> >>
> >> Here even if softirq_activated==1 (as softirq is going to happen), as long 
> >> as
> >> (cpu_base->active_bases & CLOCK_SET_BASES)==0, shouldn't it already mean 
> >> that
> >> "yes indeed clock was set, but no need to kick this cpu as no relevant 
> >> timer"?
> >> As that question seems to be orthogonal to whether a softirq is going to
> >> trigger on that cpu.
> >
> > That's correct and it's not any different from firing the IPI because in
> > both cases the update happens with the base lock of the CPU in question
> > held. And if there are no active timers in any of the affected bases,
> > then there is no need to reevaluate the next expiry because the offset
> > update does not affect any armed timers. It just makes sure that the
> > next enqueu of a timer on such a base will see the the correct offset.
> >
> > I'll just zap it.
> 
> But the whole thing is still wrong in two aspects:
> 
> 1) BOOTTIME can be one of the affected clocks when sleep time
>(suspended time) is injected because that uses the same mechanism.
> 
>Sorry for missing that earlier when I asked to remove it, but
>that's trivial to fix by adding the BOOTTIME base back.
> 
> 2) What's worse is that on resume this might break because that
>mechanism is also used to enforce the reprogramming of the clock
>event devices and there we cannot be selective on clock bases.
> 
>I need to dig deeper into that because suspend/resume has changed
>a lot over time, so this might be just a historical leftover. But
>without proper analysis we might end up with subtle and hard to
>debug wreckage.
> 
> Thanks,
> 
> tglx

Thomas,

There is no gain in avoiding the IPIs for the suspend/resume case 
(since suspending is a large interruption anyway). To avoid 
the potential complexity (and associated bugs), one option would 
be to NOT skip IPIs for the resume case.

Sending -v6 with that (and other suggestions/fixes).

[PATCH v5] hrtimer: avoid retrigger_next_event IPI

2021-04-16 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v5:
  - Add missing hrtimer_update_base (Peter Xu).

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).


diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..06fcc272e28d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,19 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   if (cpu_base->softirq_activated)
+   return true;
+
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,8 +898,34 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   else
+   hrtimer_update_base(cpu_base);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   preempt_disable();
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   cpus_read_unlock();
+   free_cpumask_var(mask);
+set_timerfd:
 #endif
timerfd_clock_was_set();
 }

[PATCH v4] hrtimer: avoid retrigger_next_event IPI

2021-04-15 Thread Marcelo Tosatti

Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..e228c0a0c98f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,19 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   if (cpu_base->softirq_activated)
+   return true;
+
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,8 +898,32 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   preempt_disable();
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   cpus_read_unlock();
+   free_cpumask_var(mask);
+set_timerfd:
 #endif
timerfd_clock_was_set();
 }

[PATCH v3] hrtimer: avoid retrigger_next_event IPI

2021-04-15 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..dd9c0d2f469f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,24 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   unsigned int active = 0;
+
+   if (cpu_base->softirq_activated)
+   return true;
+
+   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
+
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,8 +903,32 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   preempt_disable();
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   cpus_read_unlock();
+   free_cpumask_var(mask);
+set_timerfd:
 #endif
timerfd_clock_was_set();
 }

[PATCH v2] hrtimer: avoid retrigger_next_event IPI

2021-04-13 Thread Marcelo Tosatti




Setting the realtime clock triggers an IPI to all CPUs to reprogram
hrtimers.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Check if it only has monotonic active timers, and in that case 
update the realtime and TAI base offsets remotely, skipping the IPI.

This reduces interruptions to latency sensitive applications.

Signed-off-by: Marcelo Tosatti 

---

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).
   

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..be21b85c679d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
+(1U << HRTIMER_BASE_REALTIME_SOFT)|\
+(1U << HRTIMER_BASE_TAI)|  \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   unsigned int active = 0;
+
+   if (cpu_base->softirq_activated)
+   return true;
+
+   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+
+   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
+
+   if ((active & CLOCK_SET_BASES) == 0)
+   return false;
+
+   return true;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,9 +907,31 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   preempt_disable();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   free_cpumask_var(mask);
 #endif
+set_timerfd:
timerfd_clock_was_set();
 }

Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-09 Thread Marcelo Tosatti



+CC Anna-Maria.

On Fri, Apr 09, 2021 at 04:15:13PM +0200, Thomas Gleixner wrote:
> On Wed, Apr 07 2021 at 10:53, Marcelo Tosatti wrote:
> > Setting the realtime clock triggers an IPI to all CPUs to reprogram
> > hrtimers.
> >
> > However, only base, boottime and tai clocks have their offsets updated
> 
> base clock? 

Heh...

> And why boottime? Boottime is not affected by a clock
> realtime set. It's clock REALTIME and TAI, nothing else.

OK!

> > +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
> > +(1U << HRTIMER_BASE_REALTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_BOOTTIME)| \
> > +(1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_TAI)|  \
> > +(1U << HRTIMER_BASE_TAI_SOFT))
> > +
> > +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> > +{
> > +   unsigned int active = 0;
> > +
> > +   if (!cpu_base->softirq_activated)
> > +   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;

Again, if (cpu_base->softirq_activated), need to IPI (will resend).

> > +   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> > +
> > +   if ((active & CLOCK_SET_BASES) == 0)
> > +   return false;
> > +
> > +   return true;
> > +}
> 
> Errm. 

What?

> > +   /* Avoid interrupting nohz_full CPUs if possible */
> > +   preempt_disable();
> > +   for_each_online_cpu(cpu) {
> > +   if (tick_nohz_full_cpu(cpu)) {
> > +   unsigned long flags;
> > +   struct hrtimer_cpu_base *cpu_base = 
> > _cpu(hrtimer_bases, cpu);
> > +
> > +   raw_spin_lock_irqsave(_base->lock, flags);
> > +   if (need_reprogram_timer(cpu_base))
> > +   cpumask_set_cpu(cpu, mask);
> > +   else
> > +   hrtimer_update_base(cpu_base);
> > +   raw_spin_unlock_irqrestore(_base->lock, flags);
> > +   }
> > +   }
> 
> How is that supposed to be correct?
> 
> CPU0  CPU1
> 
> clock_was_set() hrtimer_start(CLOCK_REALTIME)
> 
>   if (!active_mask[CPU1] & XXX)
>   continue;
> active_mask |= REALTIME;
> 
> ---> fail because that newly started timer is on the old offset.

CPU0CPU1


clock_was_set()
Case-1: CPU-1 grabs 
base->lock before CPU-0:
CPU-0 sees 
active_mask[CPU1] and IPIs.

base = 
lock_hrtimer_base(timer, );
if 
(__hrtimer_start_range_ns(timer, tim, ...

hrtimer_reprogram(timer, true);


unlock_hrtimer_base(timer, );


raw_spin_lock_irqsave(_base->lock, flags);
if (need_reprogram_timer(cpu_base))
cpumask_set_cpu(cpu, mask);
else
hrtimer_update_base(cpu_base);
raw_spin_unlock_irqrestore(_base->lock, flags);

Case-2: CPU-1 grabs 
base->lock after CPU-0:
CPU-0 will have updated 
the offsets remotely.

base = 
lock_hrtimer_base(timer, );
if 
(__hrtimer_start_range_ns(timer, tim, ...

hrtimer_reprogram(timer, true);


unlock_hrtimer_base(timer, );


No?

Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-08 Thread Marcelo Tosatti

On Thu, Apr 08, 2021 at 12:14:57AM +0200, Frederic Weisbecker wrote:
> On Wed, Apr 07, 2021 at 10:53:01AM -0300, Marcelo Tosatti wrote:
> > 
> > Setting the realtime clock triggers an IPI to all CPUs to reprogram
> > hrtimers.
> > 
> > However, only base, boottime and tai clocks have their offsets updated
> > (and therefore potentially require a reprogram).
> > 
> > If the CPU is a nohz_full one, check if it only has 
> > monotonic active timers, and in that case update the 
> > realtime base offsets, skipping the IPI.
> > 
> > This reduces interruptions to nohz_full CPUs.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> > index 743c852e10f2..b42b1a434b22 100644
> > --- a/kernel/time/hrtimer.c
> > +++ b/kernel/time/hrtimer.c
> > @@ -853,6 +853,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> > bool reprogram)
> > tick_program_event(expires, 1);
> >  }
> >  
> > +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
> > +(1U << HRTIMER_BASE_REALTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_BOOTTIME)| \
> > +(1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_TAI)|  \
> > +(1U << HRTIMER_BASE_TAI_SOFT))
> > +
> > +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> > +{
> > +   unsigned int active = 0;
> > +
> > +   if (!cpu_base->softirq_activated)
> > +   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;

If cpu_base->softirq_activated == 1, should IPI as well.

> > +   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> > +
> > +   if ((active & CLOCK_SET_BASES) == 0)
> > +   return false;
> > +
> > +   return true;
> > +}
> > +
> >  /*
> >   * Clock realtime was set
> >   *
> > @@ -867,9 +889,41 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> > bool reprogram)
> >  void clock_was_set(void)
> >  {
> >  #ifdef CONFIG_HIGH_RES_TIMERS
> > -   /* Retrigger the CPU local events everywhere */
> > -   on_each_cpu(retrigger_next_event, NULL, 1);
> > +   cpumask_var_t mask;
> > +   int cpu;
> > +
> > +   if (!tick_nohz_full_enabled()) {
> > +   /* Retrigger the CPU local events everywhere */
> > +   on_each_cpu(retrigger_next_event, NULL, 1);
> > +   goto set_timerfd;
> > +   }
> > +
> > +   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
> > +   on_each_cpu(retrigger_next_event, NULL, 1);
> > +   goto set_timerfd;
> > +   }
> > +
> > +   /* Avoid interrupting nohz_full CPUs if possible */
> > +   preempt_disable();
> > +   for_each_online_cpu(cpu) {
> > +   if (tick_nohz_full_cpu(cpu)) {
> > +   unsigned long flags;
> > +   struct hrtimer_cpu_base *cpu_base = 
> > _cpu(hrtimer_bases, cpu);
> > +
> > +   raw_spin_lock_irqsave(_base->lock, flags);
> > +   if (need_reprogram_timer(cpu_base))
> > +   cpumask_set_cpu(cpu, mask);
> > +   else
> > +   hrtimer_update_base(cpu_base);
> > +   raw_spin_unlock_irqrestore(_base->lock, flags);
> > +   }
> 
> You forgot to add the housekeeping CPUs to the mask.

So people are using:

console=tty0 console=ttyS0,115200n8 skew_tick=1 nohz=on rcu_nocbs=8-31 
tuned.non_isolcpus=00ff intel_pstate=disable nosoftlockup tsc=nowatchdog 
intel_iommu=on iommu=pt isolcpus=managed_irq,8-31 
systemd.cpu_affinity=0,1,2,3,4,5,6,7 default_hugepagesz=1G hugepagesz=2M 
hugepages=128 nohz_full=8-31

And using the nohz_full= CPUs (or subsets of nohz_full= CPUs) in two modes:

Either "generic non-isolated applications" 
(with load-balancing enabled for those CPUs), or for 
latency sensitive applications. And switching between the modes.

In this case, it would only be possible to check for
housekeeping CPUs of type MANAGED_IRQ, which would be strange.

> As for the need_reprogram_timer() trick, I'll rather defer to Thomas review...
> 
> Thanks.

Thanks!

> 
> > +   }
> > +
> > +   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
> > +   preempt_enable();
> > +   free_cpumask_var(mask);
> >  #endif
> > +set_timerfd:
> > timerfd_clock_was_set();
> >  }
> >  
> >

Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections

2021-04-08 Thread Marcelo Tosatti

Hi Paolo,

On Thu, Apr 08, 2021 at 10:15:16AM +0200, Paolo Bonzini wrote:
> On 07/04/21 19:40, Marcelo Tosatti wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index fe806e894212..0a83eff40b43 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2562,10 +2562,12 @@ static void kvm_gen_update_masterclock(struct kvm 
> > > *kvm)
> > >   kvm_hv_invalidate_tsc_page(kvm);
> > > - spin_lock(>pvclock_gtod_sync_lock);
> > >   kvm_make_mclock_inprogress_request(kvm);
> > > +
> > Might be good to serialize against two kvm_gen_update_masterclock
> > callers? Otherwise one caller could clear KVM_REQ_MCLOCK_INPROGRESS,
> > while the other is still at pvclock_update_vm_gtod_copy().
> 
> Makes sense, but this stuff has always seemed unnecessarily complicated to
> me.
>
> KVM_REQ_MCLOCK_INPROGRESS is only needed to kick running vCPUs out of the
> execution loop; 

We do not want vcpus with different system_timestamp/tsc_timestamp
pair:

 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.

So KVM_REQ_MCLOCK_INPROGRESS also ensures that no vcpu enters 
guest mode (via vcpu->requests check before VM-entry) with a 
different system_timestamp/tsc_timestamp pair.

> clearing it in kvm_gen_update_masterclock is unnecessary,
> because KVM_REQ_CLOCK_UPDATE takes pvclock_gtod_sync_lock too and thus will
> already wait for pvclock_update_vm_gtod_copy to end.
> 
> I think it's possible to use a seqcount in KVM_REQ_CLOCK_UPDATE instead of
> KVM_REQ_MCLOCK_INPROGRESS.  Both cause the vCPUs to spin. I'll take a look.
> 
> Paolo

Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections

2021-04-07 Thread Marcelo Tosatti

On Tue, Mar 30, 2021 at 12:59:57PM -0400, Paolo Bonzini wrote:
> There is no need to include changes to vcpu->requests into
> the pvclock_gtod_sync_lock critical section.  The changes to
> the shared data structures (in pvclock_update_vm_gtod_copy)
> already occur under the lock.
> 
> Cc: David Woodhouse 
> Cc: Marcelo Tosatti 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/x86.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fe806e894212..0a83eff40b43 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2562,10 +2562,12 @@ static void kvm_gen_update_masterclock(struct kvm 
> *kvm)
>  
>   kvm_hv_invalidate_tsc_page(kvm);
>  
> - spin_lock(>pvclock_gtod_sync_lock);
>   kvm_make_mclock_inprogress_request(kvm);
> +

Might be good to serialize against two kvm_gen_update_masterclock
callers? Otherwise one caller could clear KVM_REQ_MCLOCK_INPROGRESS,
while the other is still at pvclock_update_vm_gtod_copy().

Otherwise, looks good.

[PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-07 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
hrtimers.

However, only base, boottime and tai clocks have their offsets updated
(and therefore potentially require a reprogram).

If the CPU is a nohz_full one, check if it only has 
monotonic active timers, and in that case update the 
realtime base offsets, skipping the IPI.

This reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852e10f2..b42b1a434b22 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -853,6 +853,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
+(1U << HRTIMER_BASE_REALTIME_SOFT)|\
+(1U << HRTIMER_BASE_BOOTTIME)| \
+(1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
+(1U << HRTIMER_BASE_TAI)|  \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   unsigned int active = 0;
+
+   if (!cpu_base->softirq_activated)
+   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+
+   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
+
+   if ((active & CLOCK_SET_BASES) == 0)
+   return false;
+
+   return true;
+}
+
 /*
  * Clock realtime was set
  *
@@ -867,9 +889,41 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!tick_nohz_full_enabled()) {
+   /* Retrigger the CPU local events everywhere */
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting nohz_full CPUs if possible */
+   preempt_disable();
+   for_each_online_cpu(cpu) {
+   if (tick_nohz_full_cpu(cpu)) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = 
_cpu(hrtimer_bases, cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   else
+   hrtimer_update_base(cpu_base);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+   }
+
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   free_cpumask_var(mask);
 #endif
+set_timerfd:
timerfd_clock_was_set();
 }

Re: [patch 2/3] nohz: change signal tick dependency to wakeup CPUs of member tasks

2021-02-12 Thread Marcelo Tosatti

On Fri, Feb 12, 2021 at 01:25:21PM +0100, Frederic Weisbecker wrote:
> On Thu, Jan 28, 2021 at 05:21:36PM -0300, Marcelo Tosatti wrote:
> > Rather than waking up all nohz_full CPUs on the system, only wakeup 
> > the target CPUs of member threads of the signal.
> > 
> > Reduces interruptions to nohz_full CPUs.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -444,9 +444,20 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
> >   * Set a per-taskgroup tick dependency. Posix CPU timers need this in 
> > order to elapse
> >   * per process timers.
> >   */
> > -void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits
> > bit)
> 
> Why not keeping the signal struct as a parameter?
> 
> Thanks.

All callers use "struct signal_struct *sig = tsk->signal" as
signal parameter anyway...

Can change parameters to (task, signal, bit) if you prefer.

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-04 Thread Marcelo Tosatti

On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> > On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> >> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >>>> The whole pile wants to be reverted. It's simply broken in several ways.
> >>> I was asking for your comments on interaction with CPU hotplug :-)
> >> Which I answered in an seperate mail :)
> >>
> >>> So housekeeping_cpumask has multiple meanings. In this case:
> >> ...
> >>
> >>> So as long as the meaning of the flags are respected, seems
> >>> alright.
> >> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> >> when a affinity mask spawns housekeeping and isolated is perfectly
> >> fine. It's well thought out and has no limitations.
> >>
> >>> Nitesh, is there anything preventing this from being fixed
> >>> in userspace ? (as Thomas suggested previously).
> >> Everything with is not managed can be steered by user space.
> > Yes, but it seems to be racy (that is, there is a window where the 
> > interrupt can be delivered to an isolated CPU).
> >
> > ethtool ->
> > xgbe_set_channels ->
> > xgbe_full_restart_dev ->
> > xgbe_alloc_memory ->
> > xgbe_alloc_channels ->
> > cpumask_local_spread
> >
> > Also ifconfig eth0 down / ifconfig eth0 up leads
> > to cpumask_spread_local.
> 
> There's always that possibility.

Then there is a window where isolation can be broken.

> We have to ensure that we move the IRQs by a tuned daemon or some other
> userspace script every time there is a net-dev change (eg. device comes up,
> creates VFs, etc).

Again, race window open can result in interrupt to isolated CPU.

> > How about adding a new flag for isolcpus instead?
> >
> 
> Do you mean a flag based on which we can switch the affinity mask to
> housekeeping for all the devices at the time of IRQ distribution?

Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-04 Thread Marcelo Tosatti

On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >> The whole pile wants to be reverted. It's simply broken in several ways.
> >
> > I was asking for your comments on interaction with CPU hotplug :-)
> 
> Which I answered in an seperate mail :)
> 
> > So housekeeping_cpumask has multiple meanings. In this case:
> 
> ...
> 
> > So as long as the meaning of the flags are respected, seems
> > alright.
> 
> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> when a affinity mask spawns housekeeping and isolated is perfectly
> fine. It's well thought out and has no limitations.
> 
> > Nitesh, is there anything preventing this from being fixed
> > in userspace ? (as Thomas suggested previously).
> 
> Everything with is not managed can be steered by user space.

Yes, but it seems to be racy (that is, there is a window where the 
interrupt can be delivered to an isolated CPU).

ethtool ->
xgbe_set_channels ->
xgbe_full_restart_dev ->
xgbe_alloc_memory ->
xgbe_alloc_channels ->
cpumask_local_spread

Also ifconfig eth0 down / ifconfig eth0 up leads
to cpumask_spread_local.

How about adding a new flag for isolcpus instead?

Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-01 Thread Marcelo Tosatti

On Fri, Jan 29, 2021 at 07:41:27AM -0800, Alex Belits wrote:
> On 1/28/21 07:56, Thomas Gleixner wrote:
> > External Email
> > 
> > --
> > On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > > On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
> > > > > > > /**
> > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > @@ -205,22 +206,27 @@ void __init 
> > > > > > > free_bootmem_cpumask_var(cpumask_var_t mask)
> > > > > > >  */
> > > > > > > unsigned int cpumask_local_spread(unsigned int i, int node)
> > > > > > > {
> > > > > > > - int cpu;
> > > > > > > + int cpu, hk_flags;
> > > > > > > + const struct cpumask *mask;
> > > > > > > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > > > > + mask = housekeeping_cpumask(hk_flags);
> > > > > > 
> > > > > > AFAICS, this generally resolves to something based on 
> > > > > > cpu_possible_mask
> > > > > > rather than cpu_online_mask as before, so could now potentially 
> > > > > > return an
> > > > > > offline CPU. Was that an intentional change?
> > > > > 
> > > > > Robin,
> > > > > 
> > > > > AFAICS online CPUs should be filtered.
> > > > 
> > > > Apologies if I'm being thick, but can you explain how? In the case of
> > > > isolation being disabled or compiled out, housekeeping_cpumask() is
> > > > literally just "return cpu_possible_mask;". If we then iterate over that
> > > > with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> > > > NUMA_NO_NODE case), what guarantees that CPU is actually online?
> > > > 
> > > > Robin.
> > > 
> > > Nothing, but that was the situation before 
> > > 1abdfe706a579a702799fce465bceb9fb01d407c
> > > as well.
> > > 
> > > cpumask_local_spread() should probably be disabling CPU hotplug.
> > 
> > It can't unless all callers are from preemtible code.
> > 
> > Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> > over the kernel is just wrong, really.
> > 
> > As I explained several times before there are very valid reasons for
> > having queues and interrupts on isolated CPUs. Just optimizing for the
> > usecases some people care about is not making anything better.
> 
> However making it mandatory for isolated CPUs to allow interrupts is not a
> good idea, either. Providing an environment free of disturbances is a valid
> goal, so we can't do something that will make it impossible to achieve. We
> know that both there is a great amount of demand for this feature and
> implementing it is doable, so cutting off the possibility of development in
> this direction would be bad.
> 
> Before there was housekeeping mask, I had to implement another, more
> cumbersome model that ended up being more intrusive than I wanted. That was
> one of the reasons why I have spent some time working on it in, please
> forgive me the pun, isolation.
> 
> I was relieved when housekeeping mask appeared, and I was able to remove a
> large chunk of code that distinguished between CPUs that "are there" and
> CPUs "available to run work". Housekeeping is supposed to define the set of
> CPUs that are intended to run work that is not specifically triggered by
> anything running on those CPUs. "CPUs that are there" are CPUs that are
> being maintained as a part of the system, so they are usable for running
> things on them.
> 
> My idea at the time was that we can separate this into two directions of
> development:
> 
> 1. Make sure that housekeeping mask applies to all kinds of work that
> appears on CPUs, so nothing random will end up running there. Because this
> is very much in line of what it does.

Its easier to specify "all members of set" rather than having to specify each
individual member. Thinking of set as a description of types of
activities that should not have a given CPU as a target.

> 2. Rely on housekeeping mask to exclude anything not specifically intended
> to run on isolated CPUs, and concentrate efforts on making sure that things
> that are intended to [eventually] happen on those CPUs are handled properly
> -- in case of my recent proposals, delayed until synchronization even

[patch 3/3] nohz: tick_nohz_kick_task: only IPI if remote task is running

2021-01-28 Thread Marcelo Tosatti

If the task is not running, run_posix_cpu_timers has nothing
to elapsed, so spare IPI in that case.

Suggested-by: Peter Zijlstra 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/sched/core.c
===
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -9182,3 +9182,9 @@ void call_trace_sched_update_nr_running(
 {
 trace_sched_update_nr_running_tp(rq, count);
 }
+
+bool task_on_rq(struct task_struct *p)
+{
+   return p->on_rq == TASK_ON_RQ_QUEUED;
+}
+
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -232,6 +232,8 @@ extern void io_schedule_finish(int token
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+extern bool task_on_rq(struct task_struct *p);
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -324,8 +324,6 @@ void tick_nohz_full_kick_cpu(int cpu)
 
 static void tick_nohz_kick_task(struct task_struct *tsk)
 {
-   int cpu = task_cpu(tsk);
-
/*
 * If the task concurrently migrates to another cpu,
 * we guarantee it sees the new tick dependency upon
@@ -340,6 +338,23 @@ static void tick_nohz_kick_task(struct t
 *   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
 *  LOAD p->tick_dep_mask   LOAD p->cpu
 */
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task is not running, run_posix_cpu_timers
+* has nothing to elapsed, can spare IPI in that
+* case.
+*
+* activate_task()  STORE p->tick_dep_mask
+* STORE p->task_on_rq
+* __schedule() (switch to task 'p')smp_mb() (atomic_fetch_or())
+* LOCK rq->lockLOAD p->task_on_rq
+* smp_mb__after_spin_lock()
+* tick_nohz_task_switch()
+*  LOAD p->tick_dep_mask
+*/
+   if (!task_on_rq(tsk))
+   return;
 
preempt_disable();
if (cpu_online(cpu))

[patch 2/3] nohz: change signal tick dependency to wakeup CPUs of member tasks

2021-01-28 Thread Marcelo Tosatti

Rather than waking up all nohz_full CPUs on the system, only wakeup 
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -444,9 +444,20 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  * Set a per-taskgroup tick dependency. Posix CPU timers need this in order to 
elapse
  * per process timers.
  */
-void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
+void tick_nohz_dep_set_signal(struct task_struct *tsk,
+ enum tick_dep_bits bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+   struct signal_struct *sig = tsk->signal;
+
+   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
+   if (!prev) {
+   struct task_struct *t;
+
+   lockdep_assert_held(>sighand->siglock);
+   __for_each_thread(sig, t)
+   tick_nohz_kick_task(t);
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
Index: linux-2.6/include/linux/tick.h
===
--- linux-2.6.orig/include/linux/tick.h
+++ linux-2.6/include/linux/tick.h
@@ -207,7 +207,7 @@ extern void tick_nohz_dep_set_task(struc
   enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_task(struct task_struct *tsk,
 enum tick_dep_bits bit);
-extern void tick_nohz_dep_set_signal(struct signal_struct *signal,
+extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
 enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
   enum tick_dep_bits bit);
@@ -252,11 +252,11 @@ static inline void tick_dep_clear_task(s
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
 }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit)
 {
if (tick_nohz_full_enabled())
-   tick_nohz_dep_set_signal(signal, bit);
+   tick_nohz_dep_set_signal(tsk, bit);
 }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit)
@@ -284,7 +284,7 @@ static inline void tick_dep_set_task(str
 enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_task(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit) { }
Index: linux-2.6/kernel/time/posix-cpu-timers.c
===
--- linux-2.6.orig/kernel/time/posix-cpu-timers.c
+++ linux-2.6/kernel/time/posix-cpu-timers.c
@@ -523,7 +523,7 @@ static void arm_timer(struct k_itimer *t
if (CPUCLOCK_PERTHREAD(timer->it_clock))
tick_dep_set_task(p, TICK_DEP_BIT_POSIX_TIMER);
else
-   tick_dep_set_signal(p->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(p, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 /*
@@ -1358,7 +1358,7 @@ void set_process_cpu_timer(struct task_s
if (*newval < *nextevt)
*nextevt = *newval;
 
-   tick_dep_set_signal(tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(tsk, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,

[patch 1/3] nohz: only wakeup a single target cpu when kicking a task

2021-01-28 Thread Marcelo Tosatti

When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which 
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

From: Frederic Weisbecker 
Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -322,6 +322,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -404,19 +429,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cp
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

[patch 0/3] nohz_full: only wakeup target CPUs when notifying new tick dependency (v5)

2021-01-28 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK). This patch changes the notification
to only IPI the target CPUs where the task(s) whose tick dependencies
are being updated are executing.

This reduces interruptions to nohz_full= CPUs.

v5: actually replace superfluous rcu_read_lock with lockdep_assert
v4: only IPI if the remote task is on the remote runqueue (PeterZ/Frederic)
v3: replace superfluous rcu_read_lock with lockdep_assert (PeterZ)

[patch 0/3] nohz_full: only wakeup target CPUs when notifying new tick dependency (v4)

2021-01-28 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK). This patch changes the notification
to only IPI the target CPUs where the task(s) whose tick dependencies
are being updated are executing.

This reduces interruptions to nohz_full= CPUs.

v4: only IPI if the remote task is on the remote runqueue (PeterZ/Frederic)
v3: replace superfluous rcu_read_lock with lockdep_assert (PeterZ)

[patch 1/3] nohz: only wakeup a single target cpu when kicking a task

2021-01-28 Thread Marcelo Tosatti

When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which 
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

From: Frederic Weisbecker 
Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -322,6 +322,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -404,19 +429,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cp
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

[patch 2/3] nohz: change signal tick dependency to wakeup CPUs of member tasks

2021-01-28 Thread Marcelo Tosatti

Rather than waking up all nohz_full CPUs on the system, only wakeup 
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -446,7 +446,17 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  */
 void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+
+   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
+   if (!prev) {
+   struct task_struct *t;
+
+   rcu_read_lock();
+   __for_each_thread(sig, t)
+   tick_nohz_kick_task(t);
+   rcu_read_unlock();
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)

[patch 3/3] nohz: tick_nohz_kick_task: only IPI if remote task is running

2021-01-28 Thread Marcelo Tosatti

If the task is not running, run_posix_cpu_timers has nothing
to elapsed, so spare IPI in that case.

Suggested-by: Peter Zijlstra 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/sched/core.c
===
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -9182,3 +9182,9 @@ void call_trace_sched_update_nr_running(
 {
 trace_sched_update_nr_running_tp(rq, count);
 }
+
+bool task_on_rq(struct task_struct *p)
+{
+   return p->on_rq == TASK_ON_RQ_QUEUED;
+}
+
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -232,6 +232,8 @@ extern void io_schedule_finish(int token
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+extern bool task_on_rq(struct task_struct *p);
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -324,8 +324,6 @@ void tick_nohz_full_kick_cpu(int cpu)
 
 static void tick_nohz_kick_task(struct task_struct *tsk)
 {
-   int cpu = task_cpu(tsk);
-
/*
 * If the task concurrently migrates to another cpu,
 * we guarantee it sees the new tick dependency upon
@@ -340,6 +338,23 @@ static void tick_nohz_kick_task(struct t
 *   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
 *  LOAD p->tick_dep_mask   LOAD p->cpu
 */
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task is not running, run_posix_cpu_timers
+* has nothing to elapsed, can spare IPI in that
+* case.
+*
+* activate_task()  STORE p->tick_dep_mask
+* STORE p->task_on_rq
+* __schedule() (switch to task 'p')smp_mb() (atomic_fetch_or())
+* LOCK rq->lockLOAD p->task_on_rq
+* smp_mb__after_spin_lock()
+* tick_nohz_task_switch()
+*  LOAD p->tick_dep_mask
+*/
+   if (!task_on_rq(tsk))
+   return;
 
preempt_disable();
if (cpu_online(cpu))

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-28 Thread Marcelo Tosatti

On Thu, Jan 28, 2021 at 04:56:07PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
> >> > > >/**
> >> > > > * cpumask_next - get the next cpu in a cpumask
> >> > > > @@ -205,22 +206,27 @@ void __init 
> >> > > > free_bootmem_cpumask_var(cpumask_var_t mask)
> >> > > > */
> >> > > >unsigned int cpumask_local_spread(unsigned int i, int node)
> >> > > >{
> >> > > > -int cpu;
> >> > > > +int cpu, hk_flags;
> >> > > > +const struct cpumask *mask;
> >> > > > +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > > > +mask = housekeeping_cpumask(hk_flags);
> >> > > 
> >> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> >> > > rather than cpu_online_mask as before, so could now potentially return 
> >> > > an
> >> > > offline CPU. Was that an intentional change?
> >> > 
> >> > Robin,
> >> > 
> >> > AFAICS online CPUs should be filtered.
> >> 
> >> Apologies if I'm being thick, but can you explain how? In the case of
> >> isolation being disabled or compiled out, housekeeping_cpumask() is
> >> literally just "return cpu_possible_mask;". If we then iterate over that
> >> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> >> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> >> 
> >> Robin.
> >
> > Nothing, but that was the situation before 
> > 1abdfe706a579a702799fce465bceb9fb01d407c
> > as well.
> >
> > cpumask_local_spread() should probably be disabling CPU hotplug.
> 
> It can't unless all callers are from preemtible code.
> 
> Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> over the kernel is just wrong, really.
> 
> As I explained several times before there are very valid reasons for
> having queues and interrupts on isolated CPUs. Just optimizing for the
> usecases some people care about is not making anything better.

And that is right.

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-28 Thread Marcelo Tosatti

On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> >> > +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > +mask = housekeeping_cpumask(hk_flags);
> >> 
> >> AFAICS, this generally resolves to something based on cpu_possible_mask
> >> rather than cpu_online_mask as before, so could now potentially return an
> >> offline CPU. Was that an intentional change?
> >
> > Robin,
> >
> > AFAICS online CPUs should be filtered.
> 
> The whole pile wants to be reverted. It's simply broken in several ways.

I was asking for your comments on interaction with CPU hotplug :-)
Anyway...

So housekeeping_cpumask has multiple meanings. In this case:

HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ

 domain
   Isolate from the general SMP balancing and scheduling
   algorithms. Note that performing domain isolation this way
   is irreversible: it's not possible to bring back a CPU to
   the domains once isolated through isolcpus. It's strongly
   advised to use cpusets instead to disable scheduler load
   balancing through the "cpuset.sched_load_balance" file.
   It offers a much more flexible interface where CPUs can
   move in and out of an isolated set anytime.

   You can move a process onto or off an "isolated" CPU via
   the CPU affinity syscalls or cpuset.
begins at 0 and the maximum value is
   "number of CPUs in system - 1".

 managed_irq

   Isolate from being targeted by managed interrupts
   which have an interrupt mask containing isolated
   CPUs. The affinity of managed interrupts is
   handled by the kernel and cannot be changed via
   the /proc/irq/* interfaces.

   This isolation is best effort and only effective
   if the automatically assigned interrupt mask of a
   device queue contains isolated and housekeeping
   CPUs. If housekeeping CPUs are online then such
   interrupts are directed to the housekeeping CPU
   so that IO submitted on the housekeeping CPU
   cannot disturb the isolated CPU.

   If a queue's affinity mask contains only isolated
   CPUs then this parameter has no effect on the
   interrupt routing decision, though interrupts are
   only delivered when tasks running on those
   isolated CPUs submit IO. IO submitted on
   housekeeping CPUs has no influence on those
   queues.

So as long as the meaning of the flags are respected, seems
alright.

Nitesh, is there anything preventing this from being fixed
in userspace ? (as Thomas suggested previously).

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Marcelo Tosatti

On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
> On 2021-01-27 12:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> > > Hi,
> > > 
> > > On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > > > From: Alex Belits 
> > > > 
> > > > The current implementation of cpumask_local_spread() does not respect 
> > > > the
> > > > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > > > it will return it to the caller for pinning of its IRQ threads. Having
> > > > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > > > overhead.
> > > > 
> > > > Restrict the CPUs that are returned for spreading IRQs only to the
> > > > available housekeeping CPUs.
> > > > 
> > > > Signed-off-by: Alex Belits 
> > > > Signed-off-by: Nitesh Narayan Lal 
> > > > ---
> > > >lib/cpumask.c | 16 +++-
> > > >1 file changed, 11 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > > > index fb22fb266f93..85da6ab4fbb5 100644
> > > > --- a/lib/cpumask.c
> > > > +++ b/lib/cpumask.c
> > > > @@ -6,6 +6,7 @@
> > > >#include 
> > > >#include 
> > > >#include 
> > > > +#include 
> > > >/**
> > > > * cpumask_next - get the next cpu in a cpumask
> > > > @@ -205,22 +206,27 @@ void __init 
> > > > free_bootmem_cpumask_var(cpumask_var_t mask)
> > > > */
> > > >unsigned int cpumask_local_spread(unsigned int i, int node)
> > > >{
> > > > -   int cpu;
> > > > +   int cpu, hk_flags;
> > > > +   const struct cpumask *mask;
> > > > +   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > +   mask = housekeeping_cpumask(hk_flags);
> > > 
> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> > > rather than cpu_online_mask as before, so could now potentially return an
> > > offline CPU. Was that an intentional change?
> > 
> > Robin,
> > 
> > AFAICS online CPUs should be filtered.
> 
> Apologies if I'm being thick, but can you explain how? In the case of
> isolation being disabled or compiled out, housekeeping_cpumask() is
> literally just "return cpu_possible_mask;". If we then iterate over that
> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> 
> Robin.

Nothing, but that was the situation before 
1abdfe706a579a702799fce465bceb9fb01d407c
as well.

cpumask_local_spread() should probably be disabling CPU hotplug.

Thomas?

> 
> > > I was just looking at the current code since I had the rare presence of 
> > > mind
> > > to check if something suitable already existed before I start open-coding
> > > "any online CPU, but local node preferred" logic for handling IRQ affinity
> > > in a driver - cpumask_local_spread() appears to be almost what I want (if 
> > > a
> > > bit more heavyweight), if only it would actually guarantee an online CPU 
> > > as
> > > the kerneldoc claims :(
> > > 
> > > Robin.
> > > 
> > > > /* Wrap: we always want a cpu. */
> > > > -   i %= num_online_cpus();
> > > > +   i %= cpumask_weight(mask);
> > > > if (node == NUMA_NO_NODE) {
> > > > -   for_each_cpu(cpu, cpu_online_mask)
> > > > +   for_each_cpu(cpu, mask) {
> > > > if (i-- == 0)
> > > > return cpu;
> > > > +   }
> > > > } else {
> > > > /* NUMA first. */
> > > > -   for_each_cpu_and(cpu, cpumask_of_node(node), 
> > > > cpu_online_mask)
> > > > +   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > > > if (i-- == 0)
> > > > return cpu;
> > > > +   }
> > > > -   for_each_cpu(cpu, cpu_online_mask) {
> > > > +   for_each_cpu(cpu, mask) {
> > > > /* Skip NUMA nodes, done above. */
> > > > if (cpumask_test_cpu(cpu, 
> > > > cpumask_of_node(node)))
> > > > continue;
> > > > 
> >

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Marcelo Tosatti

On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> Hi,
> 
> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > From: Alex Belits 
> > 
> > The current implementation of cpumask_local_spread() does not respect the
> > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > it will return it to the caller for pinning of its IRQ threads. Having
> > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > overhead.
> > 
> > Restrict the CPUs that are returned for spreading IRQs only to the
> > available housekeeping CPUs.
> > 
> > Signed-off-by: Alex Belits 
> > Signed-off-by: Nitesh Narayan Lal 
> > ---
> >   lib/cpumask.c | 16 +++-
> >   1 file changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > index fb22fb266f93..85da6ab4fbb5 100644
> > --- a/lib/cpumask.c
> > +++ b/lib/cpumask.c
> > @@ -6,6 +6,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   /**
> >* cpumask_next - get the next cpu in a cpumask
> > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
> > mask)
> >*/
> >   unsigned int cpumask_local_spread(unsigned int i, int node)
> >   {
> > -   int cpu;
> > +   int cpu, hk_flags;
> > +   const struct cpumask *mask;
> > +   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > +   mask = housekeeping_cpumask(hk_flags);
> 
> AFAICS, this generally resolves to something based on cpu_possible_mask
> rather than cpu_online_mask as before, so could now potentially return an
> offline CPU. Was that an intentional change?

Robin,

AFAICS online CPUs should be filtered.

> I was just looking at the current code since I had the rare presence of mind
> to check if something suitable already existed before I start open-coding
> "any online CPU, but local node preferred" logic for handling IRQ affinity
> in a driver - cpumask_local_spread() appears to be almost what I want (if a
> bit more heavyweight), if only it would actually guarantee an online CPU as
> the kerneldoc claims :(
> 
> Robin.
> 
> > /* Wrap: we always want a cpu. */
> > -   i %= num_online_cpus();
> > +   i %= cpumask_weight(mask);
> > if (node == NUMA_NO_NODE) {
> > -   for_each_cpu(cpu, cpu_online_mask)
> > +   for_each_cpu(cpu, mask) {
> > if (i-- == 0)
> > return cpu;
> > +   }
> > } else {
> > /* NUMA first. */
> > -   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> > +   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > if (i-- == 0)
> > return cpu;
> > +   }
> > -   for_each_cpu(cpu, cpu_online_mask) {
> > +   for_each_cpu(cpu, mask) {
> > /* Skip NUMA nodes, done above. */
> > if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> > continue;
> >

Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

2021-01-22 Thread Marcelo Tosatti

On Tue, Nov 24, 2020 at 12:21:06AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +, Alex Belits wrote:
> > 
> > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:
> > > External Email
> > > 
> > > ---
> > > ---
> > > On Mon, Nov 23, 2020 at 05:58:42PM +, Alex Belits wrote:
> > > > From: Yuri Norov 
> > > > 
> > > > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > > > running
> > > > isolated tasks.
> > > > 
> > > > Signed-off-by: Yuri Norov 
> > > > [abel...@marvell.com: use safe task_isolation_cpumask()
> > > > implementation]
> > > > Signed-off-by: Alex Belits 
> > > > ---
> > > >  kernel/smp.c | 14 +-
> > > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > index 4d17501433be..b2faecf58ed0 100644
> > > > --- a/kernel/smp.c
> > > > +++ b/kernel/smp.c
> > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
> > > >   */
> > > >  void kick_all_cpus_sync(void)
> > > >  {
> > > > +   struct cpumask mask;
> > > > +
> > > > /* Make sure the change is visible before we kick the cpus */
> > > > smp_mb();
> > > > -   smp_call_function(do_nothing, NULL, 1);
> > > > +
> > > > +   preempt_disable();
> > > > +#ifdef CONFIG_TASK_ISOLATION
> > > > +   cpumask_clear();
> > > > +   task_isolation_cpumask();
> > > > +   cpumask_complement(, );
> > > > +#else
> > > > +   cpumask_setall();
> > > > +#endif
> > > > +   smp_call_function_many(, do_nothing, NULL, 1);
> > > > +   preempt_enable();
> > > 
> > > Same comment about IPIs here.
> > 
> > This is different from timers. The original design was based on the
> > idea that every CPU should be able to enter kernel at any time and run
> > kernel code with no additional preparation. Then the only solution is
> > to always do full broadcast and require all CPUs to process it.
> > 
> > What I am trying to introduce is the idea of CPU that is not likely to
> > run kernel code any soon, and can afford to go through an additional
> > synchronization procedure on the next entry into kernel. The
> > synchronization is not skipped, it simply happens later, early in
> > kernel entry code.

Perhaps a bitmask of pending flushes makes more sense? 
static_key_enable IPIs is one of the users, but for its case it would 
be necessary to differentiate between in-kernel mode and out of kernel 
mode atomically (since i-cache flush must be performed if isolated CPU 
is in kernel mode).

> Ah I see, this is ordered that way:
> 
> ll_isol_flags = ISOLATED
> 
>  CPU 0CPU 1
> --   -
> // kernel entry
> data_to_sync = 1ll_isol_flags = ISOLATED_BROKEN
> smp_mb()smp_mb()
> if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync
>  smp_call(CPU 1)

Since isolated mode with syscalls is a desired feature, having a
separate atomic with in_kernel_mode = 0/1 (that is set/cleared 
on kernel entry / kernel exit, while on TIF_TASK_ISOLATION), would be
necessary (and a similar race-free logic as above).

> You should document that, ie: explain why what you're doing is safe.
> 
> Also Beware though that the data to sync in question doesn't need to be 
> visible
> in the entry code before task_isolation_kernel_enter(). You need to audit all
> the callers of kick_all_cpus_sync().

Cscope tag: flush_icache_range
   #   line  filename / context / line
   1 96  arch/arc/kernel/jump_label.c <>
 flush_icache_range(entry->code, entry->code + JUMP_LABEL_NOP_SIZE);

This case would be OK for delayed processing before kernel entry, as long as
no code before task_isolation_kernel_enter can be modified (which i am
not sure about).

But:

  36 28  arch/ia64/include/asm/cacheflush.h <>
 flush_icache_range(_addr, _addr + (len)); \

Is less certain.

Alex do you recall if arch_jump_label_transform was the only offender or 
there were others as well? (suppose handling only the ones which matter
in production at the moment, and later fixing individual ones makes most
sense).

Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks

2021-01-22 Thread Marcelo Tosatti

On Thu, Oct 01, 2020 at 04:47:31PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 22, 2020 at 02:58:24PM +, Alex Belits wrote:
> > From: Yuri Norov 
> > 

> > so we don't need to flush it.
> 
> What guarantees that we have no backlog on it?

>From Paolo's work to use lockless reading of 
per-CPU skb lists

https://www.spinics.net/lists/netdev/msg682693.html

It also exposed skb queue length to userspace

https://www.spinics.net/lists/netdev/msg684939.html

But if i remember correctly waiting for a RCU grace
period was also necessary to ensure no backlog !?! 

Paolo would you please remind us what was the sequence of steps?
(and then also, for the userspace isolation interface, where 
the application informs the kernel that its entering isolated
mode, is just confirming the queues have zero length is
sufficient?).

TIA!

> 
> > Currently flush_all_backlogs()
> > enqueues corresponding work on all CPUs including ones that run
> > isolated tasks. It leads to breaking task isolation for nothing.
> > 
> > In this patch, backlog flushing is enqueued only on non-isolated CPUs.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: use safe task_isolation_on_cpu() implementation]
> > Signed-off-by: Alex Belits 
> > ---
> >  net/core/dev.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 90b59fc50dc9..83a282f7453d 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -74,6 +74,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
> >  
> > get_online_cpus();
> >  
> > -   for_each_online_cpu(cpu)
> > +   smp_rmb();
> 
> What is it ordering?
> 
> > +   for_each_online_cpu(cpu) {
> > +   if (task_isolation_on_cpu(cpu))
> > +   continue;
> > queue_work_on(cpu, system_highpri_wq,
> >   per_cpu_ptr(_works, cpu));
> > +   }
> >  
> > for_each_online_cpu(cpu)
> > flush_work(per_cpu_ptr(_works, cpu));
> 
> Thanks.

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-15 Thread Marcelo Tosatti

On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
> On 11/12/20 22:04, Thomas Gleixner wrote:
> > > Its 100ms off with migration, and can be reduced further (customers
> > > complained about 5 seconds but seem happy with 0.1ms).
> > What is 100ms? Guaranteed maximum migration time?
> 
> I suppose it's the length between the time from KVM_GET_CLOCK and
> KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC).  But the
> VM is paused for much longer, the sequence for the non-live part of the
> migration (aka brownout) is as follows:
> 
> pause
> finish sending RAMreceive RAM   ~1 sec
> send paused-VM state  finish receiving RAM \
>   receive paused-VM state   ) 0.1 sec
>   restart  /
> 
> The nanosecond and TSC times are sent as part of the paused-VM state at the
> very end of the live migration process.
> 
> So it's still true that the time advances during live migration brownout;
> 0.1 seconds is just the final part of the live migration process.  But for
> _live_ migration there is no need to design things according to "people are
> happy if their clock is off by 0.1 seconds only".  

Agree. What would be a good way to fix this? 

It seems to me using CLOCK_REALTIME as in the interface Maxim is
proposing is prone to difference in CLOCK_REALTIME itself.

Perhaps there is another way to measure that 0.1 sec which is
independent of the clock values of the source and destination hosts
(say by sending a packet once the clock stops counting).

Then on destination measure delta = clock_restart_time - packet_receival
and increase clock by that amount.



> Again, save-to-disk,
> reverse debugging and the like are a different story, which is why KVM
> should delegate policy to userspace (while documenting how to do it right).
> 
> Paolo
> 
> > CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and
> > this state persists up to the point where NTP corrects it with a time
> > jump.
> > 
> > So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms
> > it's off by 5 seconds.
> > 
> > CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
> >

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-11 Thread Marcelo Tosatti

On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
> > On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> >> You really all live in a seperate universe creating your own rules how
> >> things which other people work hard on to get it correct can be screwed
> >> over.
> >
> > 1. T = read timestamp.
> > 2. migrate (VM stops for a certain period).
> > 3. use timestamp T.
> 
> This is exactly the problem. Time stops at pause and continues where it
> stopped on resume.
> 
> But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point
> where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME
> and CLOCK_TAI are off by tpause.
> 
> Now the application gets a packet from the outside world with a
> CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads
> from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to
> do with that? Make stupid assumptions that the other end screwed up
> timekeeping, throw an error that the system it is running on screwed up
> timekeeping? And a second later when NTP catched up it gets the next
> surprise because the systems CLOCK_REALTIME jumped forward unexpectedly
> or if there is no NTP it's confused forever.

This can happen even with a "perfect" solution that syncs time
instantly on the migration destination. See steps 1,2,3.

Unless you notify applications to invalidate their time reads,
i can't see a way to fix this.

Therefore if you use VM migration in the first place, a certain amount of
timestamp accuracy error must be tolerated.

> How can you even assume that this is correct?

As noted above, even without a window of unsynchronized time (due to
delay for NTP to sync time), time reads can be stale.

> It is exactly the same problem as we had many years ago with hardware
> clocks suddenly stopping to tick which caused quite some stuff to go
> belly up.

Customers complained when it was 5 seconds off, now its 0.1ms (and
people seem happy).

> In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
> (with a certain degree of accuracy) to compensate for the sleep time, so
> the other end of a communication is at least in the same ballpark, but
> not 50 seconds off.

Its 100ms off with migration, and can be reduced further (customers
complained about 5 seconds but seem happy with 0.1ms).

> >> This features first, correctness later frenzy is insane and it better
> >> stops now before you pile even more crap on the existing steaming pile
> >> of insanities.
> >
> > Sure.
> 
> I wish that would be true. OS people - you should know that - are
> fighting forever with hardware people over feature madness and the
> attitude of 'we can fix that in software' which turns often enough out
> to be wrong.
> 
> Now sadly enough people who suffered from that madness work on
> virtualization and instead of trying to avoid the same problem they go
> off and make it even worse.

So you think its important to reduce the 100ms offset? 

> It's the same problem again as with hardware people. Not talking to the
> other people _before_ making uninformed assumptions and decisions.
> 
> We did it that way because big customer asked for it is not a
> justification for inflicting this on everybody else and thereby
> violating correctness. Works for me and my big customer is not a proof
> of correctness either.
> 
> It's another proof that this industry just "works" by chance.
> 
> Thanks,
> 
> tglx

OK, makes sense, then reducing the 0.1ms window even further
is a useful thing to do. What would be an acceptable 
CLOCK_REALTIME accuracy error, on migration?

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-10 Thread Marcelo Tosatti

On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 12:26, Marcelo Tosatti wrote:
> > On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
> >> Marcelo,
> >> 
> >> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> >> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> >> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> >> >> > max_cycles overflow. Sent a message to Maxim describing it.
> >> >> 
> >> >> Truly helpful. Why the hell did you not talk to me when you ran into
> >> >> that the first time?
> >> >
> >> > Because 
> >> >
> >> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> >> > is paused (so we wanted to stop guest clock when VM is paused anyway).
> >> 
> >> How is that supposed to work w/o the guest kernels help if you have to
> >> keep clock realtime up to date? 
> >
> > Upon VM resume, we notify NTP daemon in the guest to sync realtime
> > clock.
> 
> Brilliant. What happens if there is no NTP daemon? What happens if the
> NTP daemon is not part of the virt orchestration magic and cannot be
> notified, then it will notice the time jump after the next update
> interval.
> 
> What about correctness?
> 
> ALL CLOCK_* stop and resume when the VM is resumed at the point where
> they stopped.
> 
> So up to the point where NTP catches up and corrects clock realtime and
> TAI other processes can observe that time jumped in the outside world,
> e.g. via a network packet or whatever, but there is no reason why time
> should have jumped outside vs. the local one.
> 
> You really all live in a seperate universe creating your own rules how
> things which other people work hard on to get it correct can be screwed
> over.

1. T = read timestamp.
2. migrate (VM stops for a certain period).
3. use timestamp T.

> Of course this all is nowhere documented in detail. At least a quick
> search with about 10 different keyword combinations revealed absolutely
> nothing.
> 
> This features first, correctness later frenzy is insane and it better
> stops now before you pile even more crap on the existing steaming pile
> of insanities.

Sure.

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-10 Thread Marcelo Tosatti

On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> >> > max_cycles overflow. Sent a message to Maxim describing it.
> >> 
> >> Truly helpful. Why the hell did you not talk to me when you ran into
> >> that the first time?
> >
> > Because 
> >
> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> > is paused (so we wanted to stop guest clock when VM is paused anyway).
> 
> How is that supposed to work w/o the guest kernels help if you have to
> keep clock realtime up to date? 

Upon VM resume, we notify NTP daemon in the guest to sync realtime
clock.
> 
> > 2) The solution to inject NMIs to the guest seemed overly
> > complicated.
> 
> Why do you need NMIs?
> 
> All you need is a way to communicate to the guest that it should prepare
> for clock madness to happen. Whether that's an IPI or a bit in a
> hyperpage which gets checked during the update of the guest timekeeping
> does not matter at all.
> 
> But you certainly do not need an NMI because there is nothing useful you
> can do within an NMI.
> 
> Thanks,
> 
> tglx

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-09 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> > On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
> >> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> >> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> >> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> >> >> > value
> >> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> >> >> > vCPU.
> >> >> > +
> >> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> >> >> > +KVM will adjust the guest TSC value by the time that passed since 
> >> >> > the moment
> >> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> >> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> >> >> 
> >> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
> >> 
> >> Which bug?
> >
> > max_cycles overflow. Sent a message to Maxim describing it.
> 
> Truly helpful. Why the hell did you not talk to me when you ran into
> that the first time?

Because 

1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
is paused (so we wanted to stop guest clock when VM is paused anyway).

2) The solution to inject NMIs to the guest seemed overly
complicated.

> >> For one I have no idea which bug you are talking about and if the bug is
> >> caused by the VMM then why would you "fix" it in the guest kernel.
> >
> > 1) Stop guest, save TSC value of cpu-0 = V.
> > 2) Wait for some amount of time = W.
> > 3) Start guest, load TSC value with V+W.
> >
> > Can cause an overflow on Linux timekeeping.
> 
> Yes, because you violate the basic assumption which Linux timekeeping
> makes. See the other mail in this thread.
> 
> Thanks,
> 
> tglx

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 06:25:13PM +0200, Maxim Levitsky wrote:
> On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
> > On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> > > > > value
> > > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> > > > > vCPU.
> > > > > +
> > > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > > +KVM will adjust the guest TSC value by the time that passed since 
> > > > > the moment
> > > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > > 
> > > > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> > 
> > Which bug?
> > 
> > > It does.
> > > Could you prepare a reproducer for this bug so I get a better idea about
> > > what are you talking about?
> > > 
> > > I assume you need very long (like days worth) jump to trigger this bug
> > > and for such case we can either work around it in qemu / kernel 
> > > or fix it in the guest kernel and I strongly prefer the latter.
> > > 
> > > Thomas, what do you think about it?
> > 
> > For one I have no idea which bug you are talking about and if the bug is
> > caused by the VMM then why would you "fix" it in the guest kernel.
> 
> The "bug" is that if VMM moves a hardware time counter (tsc or anything else) 
> forward by large enough value in one go, 
> then the guest kernel will supposingly have an overflow in the time code.
> I don't consider this to be a buggy VMM behavior, but rather a kernel
> bug that should be fixed (if this bug actually exists)

It exists.

> Purely in theory this can even happen on real hardware if for example SMM 
> handler
> blocks a CPU from running for a long duration, or hardware debugging
> interface does, or some other hardware transparent sleep mechanism kicks in
> and blocks a CPU from running.
> (We do handle this gracefully for S3/S4)
> 
> > 
> > Aside of that I think I made it pretty clear what the right thing to do
> > is.
> 
> This is orthogonal to this issue of the 'bug'. 
> Here we are not talking about per-vcpu TSC offsets, something that I said 
> that I do agree with you that it would be very nice to get rid of.
>  
> We are talking about the fact that TSC can jump forward by arbitrary large
> value if the migration took arbitrary amount of time, which 
> (assuming that the bug is real) can crash the guest kernel.

QE reproduced it.

> This will happen even if we use per VM global tsc offset.
> 
> So what do you think?
> 
> Best regards,
>   Maxim Levitsky
> 
> > 
> > Thanks,
> > 
> > tglx
> > 
>

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> >> > value
> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> >> > vCPU.
> >> > +
> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> >> > +KVM will adjust the guest TSC value by the time that passed since the 
> >> > moment
> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> >> 
> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> Which bug?

max_cycles overflow. Sent a message to Maxim describing it.

> 
> > It does.
> > Could you prepare a reproducer for this bug so I get a better idea about
> > what are you talking about?
> >
> > I assume you need very long (like days worth) jump to trigger this bug
> > and for such case we can either work around it in qemu / kernel 
> > or fix it in the guest kernel and I strongly prefer the latter.
> >
> > Thomas, what do you think about it?
> 
> For one I have no idea which bug you are talking about and if the bug is
> caused by the VMM then why would you "fix" it in the guest kernel.

1) Stop guest, save TSC value of cpu-0 = V.
2) Wait for some amount of time = W.
3) Start guest, load TSC value with V+W.

Can cause an overflow on Linux timekeeping.

> Aside of that I think I made it pretty clear what the right thing to do
> is.

Sure: the notion of a "unique TSC offset" already exists (it is detected
by write TSC logic, and not explicit in the interface, though).

But AFAIK it works pretty well.

Exposing a single TSC value on the interface level seems alright to
me...

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 04:50:53PM +0200, Maxim Levitsky wrote:
> On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> > > These two new ioctls allow to more precisly capture and
> > > restore guest's TSC state.
> > > 
> > > Both ioctls are meant to be used to accurately migrate guest TSC
> > > even when there is a significant downtime during the migration.
> > > 
> > > Suggested-by: Paolo Bonzini 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  Documentation/virt/kvm/api.rst | 65 ++
> > >  arch/x86/kvm/x86.c | 73 ++
> > >  include/uapi/linux/kvm.h   | 15 +++
> > >  3 files changed, 153 insertions(+)
> > > 
> > > diff --git a/Documentation/virt/kvm/api.rst 
> > > b/Documentation/virt/kvm/api.rst
> > > index 70254eaa5229f..ebecfe4b414ce 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is 
> > > invoked, the vCPU may
> > >  experience inconsistent filtering behavior on MSR accesses.
> > >  
> > >  
> > > +4.127 KVM_GET_TSC_STATE
> > > +
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > +  struct kvm_tsc_state {
> > > + __u32 flags;
> > > + __u64 nsec;
> > > + __u64 tsc;
> > > + __u64 tsc_adjust;
> > > +  };
> > > +
> > > +flags values for ``struct kvm_tsc_state``:
> > > +
> > > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > > +
> > > +  ``nsec`` contains nanoseconds from unix epoch.
> > > +Always set by KVM_GET_TSC_STATE, might be omitted in 
> > > KVM_SET_TSC_STATE
> > > +
> > > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > > +
> > > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > > +
> > > +
> > > +This ioctl allows the user space to read the guest's 
> > > IA32_TSC,IA32_TSC_ADJUST,
> > > +and the current value of host's CLOCK_REALTIME clock in nanoseconds 
> > > since unix
> > > +epoch.
> > 
> > Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
> > a time base, but for TSC it should not be necessary.
> 
> 
> CLOCK_REALTIME is used as an absolute time reference that should match
> on both computers. I could have used CLOCK_TAI instead for example.
> 
> The reference allows to account for time passed between saving and restoring
> the TSC as explained above.

As mentioned we don't want this due to the overflow. 

Again, i think higher priority is to allow enablement of invariant TSC
by default (to disable kvmclock).

> > > +
> > > +
> > > +4.128 KVM_SET_TSC_STATE
> > > +
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> > > value
> > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> > > vCPU.
> > > +
> > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > +KVM will adjust the guest TSC value by the time that passed since the 
> > > moment
> > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > 
> > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> It does.
> Could you prepare a reproducer for this bug so I get a better idea about
> what are you talking about?

Enable CONFIG_DEBUG_TIMEKEEPING, check what max_cycles is from the TSC
clocksource:

#ifdef CONFIG_DEBUG_TIMEKEEPING
#define WARNING_FREQ (HZ*300) /* 5 minute rate-limiting */

static void timekeeping_check_update(struct timekeeper *tk, u64 offset)
{

u64 max_cycles = tk->tkr_mono.clock->max_c

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Mon, Dec 07, 2020 at 10:04:45AM -0800, Andy Lutomirski wrote:
> 
> > On Dec 7, 2020, at 9:00 AM, Maxim Levitsky  wrote:
> > 
> > On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
>  On Dec 7, 2020, at 8:38 AM, Thomas Gleixner  wrote:
> >>> 
> >>> On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
> > On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> > From a timekeeping POV and the guests expectation of TSC this is
> > fundamentally wrong:
> > 
> > tscguest = scaled(hosttsc) + offset
> > 
> > The TSC has to be viewed systemwide and not per CPU. It's systemwide
> > used for timekeeping and for that to work it has to be synchronized. 
> > 
> > Why would this be different on virt? Just because it's virt or what? 
> > 
> > Migration is a guest wide thing and you're not migrating single vCPUs.
> > 
> > This hackery just papers over he underlying design fail that KVM looks
> > at the TSC per vCPU which is the root cause and that needs to be fixed.
>  
>  I don't disagree with you.
>  As far as I know the main reasons that kvm tracks TSC per guest are
>  
>  1. cases when host tsc is not stable 
>  (hopefully rare now, and I don't mind making
>  the new API just refuse to work when this is detected, and revert to old 
>  way
>  of doing things).
> >>> 
> >>> That's a trainwreck to begin with and I really would just not support it
> >>> for anything new which aims to be more precise and correct.  TSC has
> >>> become pretty reliable over the years.
> >>> 
>  2. (theoretical) ability of the guest to introduce per core tsc offfset
>  by either using TSC_ADJUST (for which I got recently an idea to stop
>  advertising this feature to the guest), or writing TSC directly which
>  is allowed by Intel's PRM:
> >>> 
> >>> For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> >>> which means you get the precise offset.
> >>> 
> >>> The general principle still applies from a system POV.
> >>> 
> >>>TSC base (systemwide view) - The sane case
> >>> 
> >>>TSC CPU  = TSC base + TSC_ADJUST
> >>> 
> >>> The guest TSC base is a per guest constant offset to the host TSC.
> >>> 
> >>>TSC guest base = TSC host base + guest base offset
> >>> 
> >>> If the guest want's this different per vCPU by writing to the MSR or to
> >>> TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
> >>> is the offset to the TSC base of the guest.
> >> 
> >> How about, if the guest wants to write TSC_ADJUST, it can turn off all 
> >> paravirt features and keep both pieces?
> >> 
> > 
> > This is one of the things I had in mind recently.
> > 
> > Even better, we can stop advertising TSC_ADJUST in CPUID to the guest 
> > and forbid it from writing it at all.
> 
> Seems reasonable to me.
> 
> It also seems okay for some MSRs to stop working after the guest enabled new 
> PV timekeeping.
> 
> I do have a feature request, though: IMO it would be quite nifty if the new 
> kvmclock structure could also expose NTP corrections. In other words, if you 
> could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, 
> and CLOCK_REALTIME, then we could have paravirt NTP.

Hi Andy,

Any reason why drivers/ptp/ptp_kvm.c does not work for you?

> Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds 
> in a race free way :). But I suppose that just exposing TAI and letting the 
> guest deal with the TAI - UTC offset itself would get the job done just fine.

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> These two new ioctls allow to more precisly capture and
> restore guest's TSC state.
> 
> Both ioctls are meant to be used to accurately migrate guest TSC
> even when there is a significant downtime during the migration.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Maxim Levitsky 
> ---
>  Documentation/virt/kvm/api.rst | 65 ++
>  arch/x86/kvm/x86.c | 73 ++
>  include/uapi/linux/kvm.h   | 15 +++
>  3 files changed, 153 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 70254eaa5229f..ebecfe4b414ce 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is 
> invoked, the vCPU may
>  experience inconsistent filtering behavior on MSR accesses.
>  
>  
> +4.127 KVM_GET_TSC_STATE
> +
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> +  struct kvm_tsc_state {
> + __u32 flags;
> + __u64 nsec;
> + __u64 tsc;
> + __u64 tsc_adjust;
> +  };
> +
> +flags values for ``struct kvm_tsc_state``:
> +
> +``KVM_TSC_STATE_TIMESTAMP_VALID``
> +
> +  ``nsec`` contains nanoseconds from unix epoch.
> +Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> +
> +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> +
> +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> +
> +
> +This ioctl allows the user space to read the guest's 
> IA32_TSC,IA32_TSC_ADJUST,
> +and the current value of host's CLOCK_REALTIME clock in nanoseconds since 
> unix
> +epoch.

Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
a time base, but for TSC it should not be necessary.

> +
> +
> +4.128 KVM_SET_TSC_STATE
> +
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> +
> +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> +KVM will adjust the guest TSC value by the time that passed since the moment
> +CLOCK_REALTIME timestamp was saved in the struct and current value of
> +CLOCK_REALTIME, and set the guest's TSC to the new value.

This introduces the wraparound bug in Linux timekeeping, doesnt it?

> +
> +Otherwise KVM will set the guest TSC value to the exact value as given
> +in the struct.
> +
> +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports 
> IA32_MSR_TSC_ADJUST,
> +then its value will be set to the given value from the struct.
> +
> +It is assumed that either both ioctls will be run on the same machine,
> +or that source and destination machines have synchronized clocks.



>  5. The kvm_run structure
>  
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct 
> timespec64 *ts,
>  
>   return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
>  }
> +
> +
> +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> +{
> + struct timespec64 ts;
> +
> + if (kvm_get_walltime_and_clockread(, host_tsc)) {
> + *walltime_ns = timespec64_to_ns();
> + return;
> + }
> +
> + *host_tsc = rdtsc();
> + *walltime_ns = ktime_get_real_ns();
> +}
> +
>  #endif
>  
>  /*
> @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   case KVM_CAP_X86_USER_SPACE_MSR:
>   case KVM_CAP_X86_MSR_FILTER:
>   case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> +#ifdef CONFIG_X86_64
> + case KVM_CAP_PRECISE_TSC:
> +#endif
>   r = 1;
>   break;
>   case KVM_CAP_SYNC_REGS:
> @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>   case KVM_GET_SUPPORTED_HV_CPUID:
>   r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
>   break;
> +#ifdef CONFIG_X86_64
> + case KVM_GET_TSC_STATE: {
> + struct kvm_tsc_state __user *user_tsc_state = argp;
> + u64 host_tsc;
> +
> + struct kvm_tsc_state tsc_state = {
> + .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> + };
> +
> + kvm_get_walltime(_state.nsec, _tsc);
> + tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> +

Re: [PATCH v2 0/3] RFC: Precise TSC migration

2020-12-08 Thread Marcelo Tosatti

On Thu, Dec 03, 2020 at 07:11:15PM +0200, Maxim Levitsky wrote:
> Hi!
> 
> This is the second version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html

Maxim,

Can you please make a description of what is the practical problem that is 
being fixed, preferably with instructions on how to reproduce ?

> I omitted most of the semi-offtopic points I raised related to TSC
> in the previous RFC where we can continue the discussion.
> 
> I do want to raise another thing that I almost forgot.
> 
> On AMD systems, the Linux kernel will mark the guest tsc as
> unstable unless invtsc is set which is set on recent AMD
> hardware.
> 
> Take a look at 'unsynchronized_tsc()' to verify this.
> 
> This is another thing that IMHO should be fixed at least when
> running under KVM.
> 
> Note that I forgot to mention that
> X86_FEATURE_TSC_RELIABLE also short-circuits this code,
> thus giving another reason to enable it under KVM.
> 
> Changes from V1:
> 
> - added KVM_TSC_STATE_TIMESTAMP_VALID instead of testing ns == 0
> - allow diff < 0, because it is still better that capping it to 0
> - updated tsc_msr_test unit test to cover this feature
> - refactoring
> 
> Patches to enable this feature in qemu are in the process of
> being sent to qemu-devel mailing list.
> 
> Best regards,
> Maxim Levitsky
> 
> Maxim Levitsky (3):
>   KVM: x86: implement KVM_{GET|SET}_TSC_STATE
>   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
>   kvm/selftests: update tsc_msrs_test to cover
> KVM_X86_QUIRK_TSC_HOST_ACCESS
> 
>  Documentation/virt/kvm/api.rst| 65 +
>  arch/x86/include/uapi/asm/kvm.h   |  1 +
>  arch/x86/kvm/x86.c| 92 ++-
>  include/uapi/linux/kvm.h  | 15 +++
>  .../selftests/kvm/x86_64/tsc_msrs_test.c  | 79 ++--
>  5 files changed, 237 insertions(+), 15 deletions(-)
> 
> -- 
> 2.26.2
>

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Sun, Dec 06, 2020 at 05:19:16PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
> > +   case KVM_SET_TSC_STATE: {
> > +   struct kvm_tsc_state __user *user_tsc_state = argp;
> > +   struct kvm_tsc_state tsc_state;
> > +   u64 host_tsc, wall_nsec;
> > +
> > +   u64 new_guest_tsc, new_guest_tsc_offset;
> > +
> > +   r = -EFAULT;
> > +   if (copy_from_user(_state, user_tsc_state, 
> > sizeof(tsc_state)))
> > +   goto out;
> > +
> > +   kvm_get_walltime(_nsec, _tsc);
> > +   new_guest_tsc = tsc_state.tsc;
> > +
> > +   if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > +   s64 diff = wall_nsec - tsc_state.nsec;
> > +   if (diff >= 0)
> > +   new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > +   else
> > +   new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > +   }
> > +
> > +   new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, 
> > host_tsc);
> > +   kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> 
> >From a timekeeping POV and the guests expectation of TSC this is
> fundamentally wrong:
> 
>   tscguest = scaled(hosttsc) + offset
> 
> The TSC has to be viewed systemwide and not per CPU. It's systemwide
> used for timekeeping and for that to work it has to be synchronized. 
> 
> Why would this be different on virt? Just because it's virt or what? 
> 
> Migration is a guest wide thing and you're not migrating single vCPUs.
> 
> This hackery just papers over he underlying design fail that KVM looks
> at the TSC per vCPU which is the root cause and that needs to be fixed.

It already does it: The unified TSC offset is kept at kvm->arch.cur_tsc_offset.

Re: [PATCH 0/2] RFC: Precise TSC migration

2020-12-04 Thread Marcelo Tosatti



On Thu, Dec 03, 2020 at 01:39:42PM +0200, Maxim Levitsky wrote:
> On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> > On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > > Hi Maxim,
> > > > 
> > > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > > Hi!
> > > > > 
> > > > > This is the first version of the work to make TSC migration more 
> > > > > accurate,
> > > > > as was defined by Paulo at:
> > > > > https://www.spinics.net/lists/kvm/msg225525.html
> > > > 
> > > > Description from Oliver's patch:
> > > > 
> > > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > > value introduces some challenges with synchronization as the TSCs
> > > > continue to tick throughout the restoration process. As such, KVM has
> > > > some heuristics around TSC writes to infer whether or not the guest or
> > > > host is attempting to synchronize the TSCs."
> > > > 
> > > > Not really. The synchronization logic tries to sync TSCs during
> > > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > > sequentially, say:
> > > > 
> > > > CPU realtimeTSC val
> > > > vcpu0   0 usec  0
> > > > vcpu1   100 usec0
> > > > vcpu2   200 usec0
> > > > ...
> > > > 
> > > > And we'd like to see all vcpus to read the same value at all times.
> > > > 
> > > > Other than that, comment makes sense. The problem with live migration
> > > > is as follows:
> > > > 
> > > > We'd like the TSC value to be written, ideally, just before the first
> > > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> > > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > > in vcpus tsc time).
> > > > 
> > > > Before the first VM-entry is the farthest point in time before guest
> > > > entry that one could do that.
> > > > 
> > > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > > 100ms last time i checked (which results in a 100ms time jump forward), 
> > > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > > > 
> > > > Have we measured any improvement with this patchset?
> > > 
> > > Its not about this window. 
> > > It is about time that passes between the point that we read the 
> > > TSC on source system (and we do it in qemu each time the VM is paused) 
> > > and the moment that we set the same TSC value on the target. 
> > > That time is unbounded.
> > 
> > OK. Well, its the same problem: ideally you'd want to do that just
> > before VCPU-entry.
> > 
> > > Also this patchset should decrease TSC skew that happens
> > > between restoring it on multiple vCPUs as well, since 
> > > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > > as it accounts for time passed on each vCPU.
> > > 
> > > 
> > > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > > I found out that qemu reads the kvmclock value on each pause, 
> > > and then 'restores' on unpause, using
> > > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> > > 
> > > This means (and I tested it) that if guest uses kvmclock
> > > for time reference, it will not account for time passed in
> > > the paused state.
> > 
> > Yes, this is necessary because otherwise there might be an overflow
> > in the kernel time accounting code (if the clock delta is too large).
> 
> Could you elaborate on this? Do you mean that guest kernel can crash,
> when the time 'jumps' too far forward in one go?

It can crash (there will a overflow and time will jump backwards).

> If so this will happen with kernel using TSC as well, 
> since we do let the virtual TSC to 'keep running' while VM is suspended, 
> and the goal of this patchset is to let it 'run' even while
> the VM is migrating.

True. For the overflow one, perhaps TSC can be stopped (and restored) similarly 
to
KVMCLOCK.

See QEMU's commit 00f4d64ee76e873be881a82d893a591487aa7950.

> And if there

Re: [PATCH 0/2] RFC: Precise TSC migration

2020-12-01 Thread Marcelo Tosatti

On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > Hi Maxim,
> > 
> > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > Hi!
> > > 
> > > This is the first version of the work to make TSC migration more accurate,
> > > as was defined by Paulo at:
> > > https://www.spinics.net/lists/kvm/msg225525.html
> > 
> > Description from Oliver's patch:
> > 
> > "To date, VMMs have typically restored the guest's TSCs by value using
> > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > value introduces some challenges with synchronization as the TSCs
> > continue to tick throughout the restoration process. As such, KVM has
> > some heuristics around TSC writes to infer whether or not the guest or
> > host is attempting to synchronize the TSCs."
> > 
> > Not really. The synchronization logic tries to sync TSCs during
> > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > sequentially, say:
> > 
> > CPU realtimeTSC val
> > vcpu0   0 usec  0
> > vcpu1   100 usec0
> > vcpu2   200 usec0
> > ...
> > 
> > And we'd like to see all vcpus to read the same value at all times.
> > 
> > Other than that, comment makes sense. The problem with live migration
> > is as follows:
> > 
> > We'd like the TSC value to be written, ideally, just before the first
> > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> > the vcpus tsc is ticking, which will cause a visible forward jump
> > in vcpus tsc time).
> > 
> > Before the first VM-entry is the farthest point in time before guest
> > entry that one could do that.
> > 
> > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > 100ms last time i checked (which results in a 100ms time jump forward), 
> > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > 
> > Have we measured any improvement with this patchset?
> 
> Its not about this window. 
> It is about time that passes between the point that we read the 
> TSC on source system (and we do it in qemu each time the VM is paused) 
> and the moment that we set the same TSC value on the target. 
> That time is unbounded.

OK. Well, its the same problem: ideally you'd want to do that just
before VCPU-entry.

> Also this patchset should decrease TSC skew that happens
> between restoring it on multiple vCPUs as well, since 
> KVM_SET_TSC_STATE doesn't have to happen at the same time,
> as it accounts for time passed on each vCPU.
> 
> 
> Speaking of kvmclock, somewhat offtopic since this is a different issue,
> I found out that qemu reads the kvmclock value on each pause, 
> and then 'restores' on unpause, using
> KVM_SET_CLOCK (this modifies the global kvmclock offset)
> 
> This means (and I tested it) that if guest uses kvmclock
> for time reference, it will not account for time passed in
> the paused state.

Yes, this is necessary because otherwise there might be an overflow
in the kernel time accounting code (if the clock delta is too large).

> 
> > 
> > Then Paolo mentions (with >), i am replying as usual.
> > 
> > > Ok, after looking more at the code with Maxim I can confidently say that
> > > it's a total mess.  And a lot of the synchronization code is dead
> > > because 1) as far as we could see no guest synchronizes the TSC using
> > > MSR_IA32_TSC; 
> > 
> > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > boots, it does not have to synchronize TSC in software. 
> 
> Do you have an example of such BIOS? I tested OVMF which I compiled
> from git master a few weeks ago, and I also tested this with seabios 
> from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> from BIOS.

Oh, well, QEMU then.

> Or do you refer to the native BIOS on the host doing TSC synchronization?

No, virt.

> > However, upon migration (and initialization), the KVM_SET_TSC's do 
> > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> 
> I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> is going to fix, since it accounts for it.
> 
> 
> > 
> > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > synchronization code in kvm_write_tsc.
> > 
> > Not familiar how guests are using MSR_IA32_TSC_ADJUST (o

Re: [PATCH 0/2] RFC: Precise TSC migration

2020-12-01 Thread Marcelo Tosatti

On Tue, Dec 01, 2020 at 02:48:11PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> >> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> >> but rather use IA32_TSC_ADJUST which currently doesn't participate
> >> in the tsc sync heruistics.
> >
> > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > the BIOS to boot with synced TSCs.
> 
> That's wishful thinking.
> 
> Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
> us to undo the wreckage they create.
> 
> Thanks,
> 
> tglx

Have not seen any multicore Dell/HP systems require that.

Anyway, for QEMU/KVM it should be synced (unless there is a bug
in the sync logic in the first place).

Re: [PATCH 0/2] RFC: Precise TSC migration

2020-11-30 Thread Marcelo Tosatti

Hi Maxim,

On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> Hi!
> 
> This is the first version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html

Description from Oliver's patch:

"To date, VMMs have typically restored the guest's TSCs by value using
the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
value introduces some challenges with synchronization as the TSCs
continue to tick throughout the restoration process. As such, KVM has
some heuristics around TSC writes to infer whether or not the guest or
host is attempting to synchronize the TSCs."

Not really. The synchronization logic tries to sync TSCs during
BIOS boot (and CPU hotplug), because the TSC values are loaded
sequentially, say:

CPU realtimeTSC val
vcpu0   0 usec  0
vcpu1   100 usec0
vcpu2   200 usec0
...

And we'd like to see all vcpus to read the same value at all times.

Other than that, comment makes sense. The problem with live migration
is as follows:

We'd like the TSC value to be written, ideally, just before the first
VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
the vcpus tsc is ticking, which will cause a visible forward jump
in vcpus tsc time).

Before the first VM-entry is the farthest point in time before guest
entry that one could do that.

The window (or forward jump) between KVM_SET_TSC and VM-entry was about
100ms last time i checked (which results in a 100ms time jump forward), 
See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.

Have we measured any improvement with this patchset?

Then Paolo mentions (with >), i am replying as usual.

> Ok, after looking more at the code with Maxim I can confidently say that
> it's a total mess.  And a lot of the synchronization code is dead
> because 1) as far as we could see no guest synchronizes the TSC using
> MSR_IA32_TSC; 

Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
boots, it does not have to synchronize TSC in software. 

However, upon migration (and initialization), the KVM_SET_TSC's do 
not happen at exactly the same time (the MSRs for each vCPU are loaded
in sequence). The synchronization code in kvm_set_tsc() is for those cases.

> and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> synchronization code in kvm_write_tsc.

Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
Lets see:

/*
 * Freshly booted CPUs call into this:
 */
void check_tsc_sync_target(void)
{
struct tsc_adjust *cur = this_cpu_ptr(_adjust);
unsigned int cpu = smp_processor_id();
cycles_t cur_max_warp, gbl_max_warp;
int cpus = 2;

/* Also aborts if there is no TSC. */
if (unsynchronized_tsc())
return;

/*
 * Store, verify and sanitize the TSC adjust register. If
 * successful skip the test.
 *
 * The test is also skipped when the TSC is marked reliable. This
 * is true for SoCs which have no fallback clocksource. On these
 * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
 * register might have been wreckaged by the BIOS..
 */
if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
atomic_inc(_test);
return;
}

retry:

I'd force that synchronization path to be taken as a test-case.

> I have a few thoughts about the kvm masterclock synchronization,
> which relate to the Paulo's proposal that I implemented.
> 
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.

We _have_ to base. See the comment which starts with

"Assuming a stable TSC across physical CPUS, and a stable TSC"

at x86.c.

> 
> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.

Actually, without synchronized host TSCs (and the masterclock scheme,
with a single base read from a vCPU), kvmclock in kernel is buggy as
well:

u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
unsigned version;
u64 ret;
u64 last;
u8 flags;

do {
version = pvclock_read_begin(src);
ret = __pvclock_read_cycles(src, rdtsc_ordered());
flags = src->flags;
} while (pvclock_read_retry(src, version));

if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
}

if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
(flags &

Re: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state

2020-11-27 Thread Marcelo Tosatti

On Thu, Nov 26, 2020 at 07:24:41PM +0100, Rafael J. Wysocki wrote:
> On Thu, Nov 26, 2020 at 6:25 PM Mel Gorman  
> wrote:
> >
> > It was noted that a few workloads that idle rapidly regressed when commit
> > 36fcb4292473 ("cpuidle: use first valid target residency as poll time")
> > was merged. The workloads in question were heavy communicators that idle
> > rapidly and were impacted by the c-state exit latency as the active CPUs
> > were not polling at the time of wakeup. As they were not particularly
> > realistic workloads, it was not considered to be a major problem.
> >
> > Unfortunately, a bug was then reported for a real workload in a production
> > environment that relied on large numbers of threads operating in a worker
> > pool pattern. These threads would idle for periods of time slightly
> > longer than the C1 exit latency and so incurred the c-state exit latency.
> > The application is sensitive to wakeup latency and appears to indirectly
> > rely on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state:
> > Add time limit to poll_idle()") to poll for long enough to avoid the exit
> > latency cost.
> 
> Well, this means that it depends on the governor to mispredict short
> idle durations (so it selects "poll" over "C1" when it should select
> "C1" often enough) and on the lack of a polling limit (or a large
> enough one).
> 
> While the latter can be kind of addressed by increasing the polling
> limit, the misprediction in the governor really isn't guaranteed to
> happen and it really is necessary to have a PM QoS request in place to
> ensure a suitable latency.
> 
> > The current behaviour favours power consumption over wakeup latency
> > and it is reasonable behaviour but it should be tunable.
> 
> Only if there is no way to cover all of the relevant use cases in a
> generally acceptable way without adding more module params etc.
> 
> In this particular case, it should be possible to determine a polling
> limit acceptable to everyone.
> 
> BTW, I admit that using the exit latency of the lowest enabled C-state
> was kind of arbitrary and it was based on the assumption that it would
> make more sense to try to enter C1 instead of polling for that much
> time, but C1 is an exception, because it is often artificially made
> particularly attractive to the governors (by reducing its target
> residency as much as possible).  Also making the polling limit that
> short distorts the governor statistics somewhat.
> 
> So the polling limit equal to the target residency of C1 really may be
> overly aggressive and something tick-based may work better in general
> (e.g. 1/8 or 1/16 of the tick period).
> 
> In principle, a multiple of C1 target residency could be used too.
> 
> > In theory applications could use /dev/cpu_dma_latency but not all 
> > applications
> > are aware of cpu_dma_latency. Similarly, a tool could be installed
> > that opens cpu_dma_latency for the whole system but such a tool is not
> > always available, is not always known to the sysadmin or the tool can have
> > unexpected side-effects if it tunes more than cpu_dma_latency. In practice,
> > it is more common for sysadmins to try idle=poll (which is x86 specific)
> 
> And really should be avoided if one cares about turbo or wants to
> avoid thermal issues.
> 
> > or try disabling c-states and hope for the best.
> >
> > This patch makes it straight-forward to configure how long a CPU should
> > poll before entering a c-state.
> 
> Well, IMV this is not straightforward at all.
> 
> It requires the admin to know how cpuidle works and why this
> particular polling limit is likely to be suitable for the given
> workload.  And whether or not the default polling limit should be
> changed at all.

KVM polling (virt/kvm/kvm_main.c grow_halt_poll_ns/shrink_halt_poll_ns)
tries to adjust the polling window based on poll success/failure. 

The cpuidle haltpoll governor (for KVM guests) uses the same adjustment
logic.

Perhaps a similar (or improved) scheme can be adapted to baremetal.

https://www.kernel.org/doc/Documentation/virtual/kvm/halt-polling.txt
> 
> Honestly, nobody knows that in advance (with all due respect) and this
> would cause people to try various settings at random and stick to the
> one that they feel works best for them without much understanding.
> 
> > By default, there is no behaviour change.
> > At build time a decision can be made to favour performance over power
> > by default even if that potentially impacts turbo boosting for workloads
> > that are sensitive to wakeup latency. In the event the kernel default is
> > not suitable, the kernel command line can be used as a substitute for
> > implementing cpu_dma_latency support in an application or requiring an
> > additional tool to be installed.
> >
> > Note that it is not expected that tuning for longer polling times will be a
> > universal win. For example, extra polling might prevent a turbo state being
> > used or prevent hyperthread resources being

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-27 Thread Marcelo Tosatti

On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/26/20 5:50 PM, Thomas Gleixner wrote:
> > On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
> >> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
> >>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>  Are there drivers which use more than one interrupt per queue? I know
>  drivers have multiple management interrupts.. and I guess some drivers
>  do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>  have multiple queues for one interrupt .. I'm not sure how a single
>  queue with multiple interrupts would work though.
> >>> For block there is always one interrupt per queue. Some Network drivers
> >>> seem to have seperate RX and TX interrupts per queue.
> >> That's true when thinking of Tx and Rx as a single queue. Another way to
> >> think about it is "one rx queue" and "one tx queue" each with their own
> >> interrupt...
> >>
> >> Even if there are devices which force there to be exactly queue pairs,
> >> you could still think of them as separate entities?
> > Interesting thought.
> >
> > But as Jakub explained networking queues are fundamentally different
> > from block queues on the RX side. For block the request issued on queue
> > X will raise the complete interrupt on queue X.
> >
> > For networking the TX side will raise the TX interrupt on the queue on
> > which the packet was queued obviously or should I say hopefully. :)
> 
> This is my impression as well.
> 
> > But incoming packets will be directed to some receive queue based on a
> > hash or whatever crystallball logic the firmware decided to implement.
> >
> > Which makes this not really suitable for the managed interrupt and
> > spreading approach which is used by block-mq. Hrm...
> >
> > But I still think that for curing that isolation stuff we want at least
> > some information from the driver. Alternative solution would be to grant
> > the allocation of interrupts and queues and have some sysfs knob to shut
> > down queues at runtime. If that shutdown results in releasing the queue
> > interrupt (via free_irq()) then the vector exhaustion problem goes away.
> 
> I think this is close to what I and Marcelo were discussing earlier today
> privately.
> 
> I don't think there is currently a way to control the enablement/disablement 
> of
> interrupts from the userspace.

As long as the interrupt obeys the "trigger when request has been
performed by local CPU" rule (#1) (for MSI type interrupts, where driver 
allocates
one I/O interrupt per CPU), don't see a need for the interface.

For other types of interrupts, interrupt controller should be programmed
to not include the isolated CPU on its "destination CPU list".

About the block VS network discussion, what we are trying to do at skb
level (Paolo Abeni CC'ed, author of the suggestion) is to use RPS to
avoid skbs from being queued to a CPU (on RX), and to queue skbs
on housekeeping CPUs for processing (TX).

However, if per-CPU interrupts are not disabled, then the (for example)
network device is free to include the CPU in its list of destinations.
Which would require one to say, configure RPS (or whatever mechanism
is distributing interrupts).

Hum, it would feel safer (rather than trust the #1 rule to be valid
in all cases) to ask the driver to disable the interrupt (after shutting
down queue) for that particular CPU.

BTW, Thomas, software is free to configure a particular MSI-X interrupt
to point to any CPU:

10.11 MESSAGE SIGNALLED INTERRUPTS
The PCI Local Bus Specification, Rev 2.2 (www.pcisig.com) introduces the 
concept of message signalled interrupts.
As the specification indicates:
“Message signalled interrupts (MSI) is an optional feature that enables PCI 
devices to request
service by writing a system-specified message to a system-specified address 
(PCI DWORD memory
write transaction). The transaction address specifies the message destination 
while the transaction
data specifies the message. System software is expected to initialize the 
message destination and
message during device configuration, allocating one or more non-shared messages 
to each MSI
capable function.”

Fields in the Message Address Register are as follows:
1. Bits 31-20 — These bits contain a fixed value for interrupt messages 
(0FEEH). This value locates interrupts at
the 1-MByte area with a base address of 4G – 18M. All accesses to this region 
are directed as interrupt
messages. Care must to be taken to ensure that no other device claims the 
region as I/O space.
2. Destination ID — This field contains an 8-bit destination ID. It identifies 
the message’s target processor(s).
The destination ID corresponds to bits 63:56 of the I/O APIC Redirection Table 
Entry if the IOAPIC is used to
dispatch the interrupt to the processor(s).

---

So taking the example where computation happens while isolated and later
stored via block interface, aren't we restricting the usage scenarios
by enforcing

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-27 Thread Marcelo Tosatti

On Mon, Oct 26, 2020 at 08:00:39PM +0100, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
> > On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
> >> So without information from the driver which tells what the best number
> >> of interrupts is with a reduced number of CPUs, this cutoff will cause
> >> more problems than it solves. Regressions guaranteed.
> >
> > One might want to move from one interrupt per isolated app core
> > to zero, or vice versa. It seems that "best number of interrupts 
> > is with reduced number of CPUs" information, is therefore in userspace, 
> > not in driver...
> 
> How does userspace know about the driver internals? Number of management
> interrupts, optimal number of interrupts per queue?
> 
> >> Managed interrupts base their interrupt allocation and spreading on
> >> information which is handed in by the individual driver and not on crude
> >> assumptions. They are not imposing restrictions on the use case.
> >> 
> >> It's perfectly fine for isolated work to save a data set to disk after
> >> computation has finished and that just works with the per-cpu I/O queue
> >> which is otherwise completely silent. 
> >
> > Userspace could only change the mask of interrupts which are not 
> > triggered by requests from the local CPU (admin, error, mgmt, etc),
> > to avoid the vector exhaustion problem.
> >
> > However, there is no explicit way for userspace to know that, as far as
> > i know.
> >
> >  130:  34845  0  0  0  0  0 
> >  0  0  IR-PCI-MSI 33554433-edge  nvme0q1
> >  131:  0  27062  0  0  0  0 
> >  0  0  IR-PCI-MSI 33554434-edge  nvme0q2
> >  132:  0  0  24393  0  0  0 
> >  0  0  IR-PCI-MSI 33554435-edge  nvme0q3
> >  133:  0  0  0  24313  0  0 
> >  0  0  IR-PCI-MSI 33554436-edge  nvme0q4
> >  134:  0  0  0  0  20608  0 
> >  0  0  IR-PCI-MSI 33554437-edge  nvme0q5
> >  135:  0  0  0  0  0  22163 
> >  0  0  IR-PCI-MSI 33554438-edge  nvme0q6
> >  136:  0  0  0  0  0  0 
> >  23020  0  IR-PCI-MSI 33554439-edge  nvme0q7
> >  137:  0  0  0  0  0  0 
> >  0  24285  IR-PCI-MSI 33554440-edge  nvme0q8
> >
> > Can that be retrieved from PCI-MSI information, or drivers
> > have to inform this?
> 
> The driver should use a different name for the admin queues.

Works for me.

Sounds more like a heuristic which can break, so documenting this 
as an "interface" seems appropriate.

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Marcelo Tosatti

On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> > On 10/23/20 4:58 AM, Peter Zijlstra wrote:
> >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
> >> So shouldn't we then fix the drivers / interface first, to get rid of
> >> this inconsistency?
> >>
> > Considering we agree that excess vector is a problem that needs to be
> > solved across all the drivers and that you are comfortable with the other
> > three patches in the set. If I may suggest the following:
> >
> > - We can pick those three patches for now, as that will atleast fix a
> >   driver that is currently impacting RT workloads. Is that a fair
> >   expectation?
> 
> No. Blindly reducing the maximum vectors to the number of housekeeping
> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
> what the right number of interrupts for this situation is.
> 
> Many of these drivers need more than queue interrupts, admin, error
> interrupt and some operate best with seperate RX/TX interrupts per
> queue. They all can "work" with a single PCI interrupt of course, but
> the price you pay is performance.
> 
> An isolated setup, which I'm familiar with, has two housekeeping
> CPUs. So far I restricted the number of network queues with a module
> argument to two, which allocates two management interrupts for the
> device and two interrupts (RX/TX) per queue, i.e. a total of six.
> 
> Now I reduced the number of available interrupts to two according to
> your hack, which makes it use one queue RX/TX combined and one
> management interrupt. Guess what happens? Network performance tanks to
> the points that it breaks a carefully crafted setup.
> 
> The same applies to a device which is application specific and wants one
> channel including an interrupt per isolated application core. Today I
> can isolate 8 out of 12 CPUs and let the device create 8 channels and
> set one interrupt and channel affine to each isolated CPU. With your
> hack, I get only 4 interrupts and channels. Fail!

Good point.

> You cannot declare that all this is perfectly fine, just because it does
> not matter for your particular use case.
> 
> So without information from the driver which tells what the best number
> of interrupts is with a reduced number of CPUs, this cutoff will cause
> more problems than it solves. Regressions guaranteed.

One might want to move from one interrupt per isolated app core
to zero, or vice versa. It seems that "best number of interrupts 
is with reduced number of CPUs" information, is therefore in userspace, 
not in driver...

No?

> Managed interrupts base their interrupt allocation and spreading on
> information which is handed in by the individual driver and not on crude
> assumptions. They are not imposing restrictions on the use case.
> 
> It's perfectly fine for isolated work to save a data set to disk after
> computation has finished and that just works with the per-cpu I/O queue
> which is otherwise completely silent. 

Userspace could only change the mask of interrupts which are not 
triggered by requests from the local CPU (admin, error, mgmt, etc),
to avoid the vector exhaustion problem.

However, there is no explicit way for userspace to know that, as far as
i know.

 130:  34845  0  0  0  0  0 
 0  0  IR-PCI-MSI 33554433-edge  nvme0q1
 131:  0  27062  0  0  0  0 
 0  0  IR-PCI-MSI 33554434-edge  nvme0q2
 132:  0  0  24393  0  0  0 
 0  0  IR-PCI-MSI 33554435-edge  nvme0q3
 133:  0  0  0  24313  0  0 
 0  0  IR-PCI-MSI 33554436-edge  nvme0q4
 134:  0  0  0  0  20608  0 
 0  0  IR-PCI-MSI 33554437-edge  nvme0q5
 135:  0  0  0  0  0  22163 
 0  0  IR-PCI-MSI 33554438-edge  nvme0q6
 136:  0  0  0  0  0  0  
23020  0  IR-PCI-MSI 33554439-edge  nvme0q7
 137:  0  0  0  0  0  0 
 0  24285  IR-PCI-MSI 33554440-edge  nvme0q8


Can that be retrieved from PCI-MSI information, or drivers
have to inform this? 

> All isolated workers can do the
> same in parallel without trampling on each other toes by competing for a
> reduced number of queues which are affine to the housekeeper CPUs.
> 
> Unfortunately network multi-queue is substantially different from block
> multi-queue (as I learned in this conversation), so the concept cannot
> be applied one-to-one to networking as is. But there are certainly part
> of it which can be reused.
> 
> This needs a lot more thought than just these crude hacks.
> 
> Especially under the aspect that there

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-22 Thread Marcelo Tosatti

On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.

Exactly. It seems the proper way to do handle this is to disable
individual vectors rather than moving them. And that is needed for 
dynamic isolate / unisolate anyway...

> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

Aha... But it would be necessary to do that from userspace (for runtime
isolate/unisolate).

Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a task

2020-10-20 Thread Marcelo Tosatti

On Thu, Oct 15, 2020 at 01:40:53AM +0200, Frederic Weisbecker wrote:
> On Wed, Oct 14, 2020 at 10:33:21AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 13, 2020 at 02:13:28PM -0300, Marcelo Tosatti wrote:
> > 
> > > > Yes but if the task isn't running, run_posix_cpu_timers() doesn't have
> > > > anything to elapse. So indeed we can spare the IPI if the task is not
> > > > running. Provided ordering makes sure that the task sees the new 
> > > > dependency
> > > > when it schedules in of course.
> > > 
> > > True.
> > > 
> > >  * p->on_cpu <- { 0, 1 }:
> > >  *
> > >  *   is set by prepare_task() and cleared by finish_task() such that it 
> > > will be
> > >  *   set before p is scheduled-in and cleared after p is scheduled-out, 
> > > both
> > >  *   under rq->lock. Non-zero indicates the task is running on its CPU.
> > > 
> > > 
> > > CPU-0 (tick_set_dep)CPU-1 (task switch)
> > > 
> > > STORE p->tick_dep_mask
> > > smp_mb() (atomic_fetch_or())
> > > LOAD p->on_cpu
> > > 
> > > 
> > > context_switch(prev, next)
> > > STORE next->on_cpu = 1
> > > ... [*]
> > > 
> > > LOAD current->tick_dep_mask
> > > 
> > 
> > That load is in tick_nohz_task_switch() right? (which BTW is placed
> > completely wrong) You could easily do something like the below I
> > suppose.
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 81632cd5e3b7..2a5fafe66bb0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -410,6 +410,14 @@ void __tick_nohz_task_switch(void)
> > ts = this_cpu_ptr(_cpu_sched);
> >  
> > if (ts->tick_stopped) {
> > +   /*
> > +* tick_set_dep()   (this)
> > +*
> > +* STORE p->tick_dep_mask   STORE p->on_cpu
> > +* smp_mb() smp_mb()
> > +* LOAD p->on_cpu   LOAD p->tick_dep_mask
> > +*/
> > +   smp_mb();
> > if (atomic_read(>tick_dep_mask) ||
> > atomic_read(>signal->tick_dep_mask))
> > tick_nohz_full_kick();
> 
> It would then need to be unconditional (whatever value of ts->tick_stopped).
> Assuming the tick isn't stopped, we may well have an interrupt firing right
> after schedule() which doesn't see the new value of tick_dep_map.
> 
> Alternatively, we could rely on p->on_rq which is set to TASK_ON_RQ_QUEUED
> at wake up time, prior to the schedule() full barrier. Of course that doesn't
> mean that the task is actually the one running on the CPU but it's a good 
> sign,
> considering that we are running in nohz_full mode and it's usually optimized
> for single task mode.

Unfortunately that would require exporting p->on_rq which is internal to
the scheduler, locklessly.

(can surely do that if you prefer!)

> 
> Also setting a remote task's tick dependency is only used by posix cpu timer
> in case the user has the bad taste to enqueue on a task running in nohz_full
> mode. It shouldn't deserve an unconditional full barrier in the schedule path.
> 
> If the target is current, as is used by RCU, I guess we can keep a special
> treatment.

To answer PeterZ's original question:

"So we need to kick the CPU unconditionally, or only when the task is
actually running? AFAICT we only care about current->tick_dep_mask."

If there is a task sharing signals, executing on a remote CPU, yes that remote 
CPU 
should be awakened.

Could skip the IPI if the remote process is not running, however for 
the purposes of low latency isolated processes, this optimization is
not necessary.

So the last posted patchset is good enough for isolated low latency processes.

Do you guys want me to do something or can you take the series as it is?

> > re tick_nohz_task_switch() being placed wrong, it should probably be
> > placed before finish_lock_switch(). Something like so.
> > 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index cf044580683c..5c92c959824f 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4084,6 +4084,7 @@ static struct rq *finish_task_switch(struct 
> > task_struct *prev)
> > vtime_task_switch(prev);
> > perf_event_task_sched_in(prev, current);
> > finish_task(prev);
> > +   tick_nohz_task_switch();
> > finish_lock_switch(rq);
> > finish_arch_post_lock_switch();
> > kcov_finish_switch(current);
> > @@ -4121,7 +4122,6 @@ static struct rq *finish_task_switch(struct 
> > task_struct *prev)
> > put_task_struct_rcu_user(prev);
> > }
> >  
> > -   tick_nohz_task_switch();
> 
> IIRC, we wanted to keep it outside rq lock because it shouldn't it...

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-19 Thread Marcelo Tosatti

On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> > >> +hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > >> +
> > >> +/*
> > >> + * If we have isolated CPUs for use by real-time tasks, to keep 
> > >> the
> > >> + * latency overhead to a minimum, device-specific IRQ vectors 
> > >> are moved
> > >> + * to the housekeeping CPUs from the userspace by changing their
> > >> + * affinity mask. Limit the vector usage to keep housekeeping 
> > >> CPUs from
> > >> + * running out of IRQ vectors.
> > >> + */
> > >> +if (hk_cpus < num_online_cpus()) {
> > >> +if (hk_cpus < min_vecs)
> > >> +max_vecs = min_vecs;
> > >> +else if (hk_cpus < max_vecs)
> > >> +max_vecs = hk_cpus;
> > > is that:
> > >
> > >   max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> > 
> > Yes, I think this will do.
> > 
> > >
> > > Also, do we really need to have that conditional on hk_cpus <
> > > num_online_cpus()? That is, why can't we do this unconditionally?
> > 
> > FWIU most of the drivers using this API already restricts the number of
> > vectors based on the num_online_cpus, if we do it unconditionally we can
> > unnecessary duplicate the restriction for cases where we don't have any
> > isolated CPUs.
> 
> unnecessary isn't really a concern here, this is a slow path. What's
> important is code clarity.
> 
> > Also, different driver seems to take different factors into consideration
> > along with num_online_cpus while finding the max_vecs to request, for
> > example in the case of mlx5:
> > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
> >    MLX5_EQ_VEC_COMP_BASE
> > 
> > Having hk_cpus < num_online_cpus() helps us ensure that we are only
> > changing the behavior when we have isolated CPUs.
> > 
> > Does that make sense?
> 
> That seems to want to allocate N interrupts per cpu (plus some random
> static amount, which seems weird, but whatever). This patch breaks that.

On purpose. For the isolated CPUs we don't want network device 
interrupts (in this context).

> So I think it is important to figure out what that driver really wants
> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> only reduce the number of CPUs, the proposed interface is wrong.

It wants N interrupts per non-isolated (AKA housekeeping) CPU.
Zero interrupts for isolated interrupts.

> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Yes, but this patch is a fix for customer bug in the old, static on-boot 
isolation CPU configuration.

---

Discussing the dynamic configuration (not this patch!) case:

Would need to enable/disable interrupts for a particular device 
on a per-CPU basis. Such interface does not exist yet.

Perhaps that is what you are looking for when writing "proposed interface
is wrong" Peter?

Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a task

2020-10-13 Thread Marcelo Tosatti

On Thu, Oct 08, 2020 at 09:54:44PM +0200, Frederic Weisbecker wrote:
> On Thu, Oct 08, 2020 at 02:54:09PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 08, 2020 at 02:22:56PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 07, 2020 at 03:01:52PM -0300, Marcelo Tosatti wrote:
> > > > When adding a tick dependency to a task, its necessary to
> > > > wakeup the CPU where the task resides to reevaluate tick
> > > > dependencies on that CPU.
> > > > 
> > > > However the current code wakes up all nohz_full CPUs, which 
> > > > is unnecessary.
> > > > 
> > > > Switch to waking up a single CPU, by using ordering of writes
> > > > to task->cpu and task->tick_dep_mask.
> > > > 
> > > > From: Frederic Weisbecker 
> > > > Suggested-by: Peter Zijlstra 
> > > > Signed-off-by: Frederic Weisbecker 
> > > > Signed-off-by: Marcelo Tosatti 
> > > > 
> > > > Index: linux-2.6/kernel/time/tick-sched.c
> > > > ===
> > > > --- linux-2.6.orig/kernel/time/tick-sched.c
> > > > +++ linux-2.6/kernel/time/tick-sched.c
> > > > @@ -274,6 +274,31 @@ void tick_nohz_full_kick_cpu(int cpu)
> > > > irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
> > > >  }
> > > >  
> > > > +static void tick_nohz_kick_task(struct task_struct *tsk)
> > > > +{
> > > > +   int cpu = task_cpu(tsk);
> > > > +
> > > > +   /*
> > > > +* If the task concurrently migrates to another cpu,
> > > > +* we guarantee it sees the new tick dependency upon
> > > > +* schedule.
> > > > +*
> > > > +*
> > > > +* set_task_cpu(p, cpu);
> > > > +*   STORE p->cpu = @cpu
> > > > +* __schedule() (switch to task 'p')
> > > > +*   LOCK rq->lock
> > > > +*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
> > > > +*   tick_nohz_task_switch()smp_mb() 
> > > > (atomic_fetch_or())
> > > > +*  LOAD p->tick_dep_mask   LOAD p->cpu
> > > > +*/
> > > > +
> > > > +   preempt_disable();
> > > > +   if (cpu_online(cpu))
> > > > +   tick_nohz_full_kick_cpu(cpu);
> > > > +   preempt_enable();
> > > > +}
> > > 
> > > So we need to kick the CPU unconditionally, or only when the task is
> > > actually running? AFAICT we only care about current->tick_dep_mask.
> > 
> > tick is necessary to execute run_posix_cpu_timers, from tick interrupt, 
> > even if task is not running.
> 
> Yes but if the task isn't running, run_posix_cpu_timers() doesn't have
> anything to elapse. So indeed we can spare the IPI if the task is not
> running. Provided ordering makes sure that the task sees the new dependency
> when it schedules in of course.

True.

 * p->on_cpu <- { 0, 1 }:
 *
 *   is set by prepare_task() and cleared by finish_task() such that it will be
 *   set before p is scheduled-in and cleared after p is scheduled-out, both
 *   under rq->lock. Non-zero indicates the task is running on its CPU.


CPU-0 (tick_set_dep)CPU-1 (task switch)

STORE p->tick_dep_mask
smp_mb() (atomic_fetch_or())
LOAD p->on_cpu


context_switch(prev, next)
STORE next->on_cpu = 1
... [*]

LOAD current->tick_dep_mask


Don't see any explicit memory barrier in the [*] section?

[patch 2/2] nohz: change signal tick dependency to wakeup CPUs of member tasks

2020-10-08 Thread Marcelo Tosatti

Rather than waking up all nohz_full CPUs on the system, only wakeup 
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -396,9 +396,17 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  * Set a per-taskgroup tick dependency. Posix CPU timers need this in order to 
elapse
  * per process timers.
  */
-void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
+void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+   struct task_struct *p;
+
+   prev = atomic_fetch_or(BIT(bit), >signal->tick_dep_mask);
+   if (!prev) {
+   lockdep_assert_held(>sighand->siglock);
+   for_each_thread(tsk, p)
+   tick_nohz_kick_task(p);
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
Index: linux-2.6/include/linux/tick.h
===
--- linux-2.6.orig/include/linux/tick.h
+++ linux-2.6/include/linux/tick.h
@@ -207,7 +207,7 @@ extern void tick_nohz_dep_set_task(struc
   enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_task(struct task_struct *tsk,
 enum tick_dep_bits bit);
-extern void tick_nohz_dep_set_signal(struct signal_struct *signal,
+extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
 enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
   enum tick_dep_bits bit);
@@ -252,11 +252,11 @@ static inline void tick_dep_clear_task(s
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
 }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit)
 {
if (tick_nohz_full_enabled())
-   tick_nohz_dep_set_signal(signal, bit);
+   tick_nohz_dep_set_signal(tsk, bit);
 }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit)
@@ -284,7 +284,7 @@ static inline void tick_dep_set_task(str
 enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_task(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit) { }
Index: linux-2.6/kernel/time/posix-cpu-timers.c
===
--- linux-2.6.orig/kernel/time/posix-cpu-timers.c
+++ linux-2.6/kernel/time/posix-cpu-timers.c
@@ -523,7 +523,7 @@ static void arm_timer(struct k_itimer *t
if (CPUCLOCK_PERTHREAD(timer->it_clock))
tick_dep_set_task(p, TICK_DEP_BIT_POSIX_TIMER);
else
-   tick_dep_set_signal(p->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(p, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 /*
@@ -1358,7 +1358,7 @@ void set_process_cpu_timer(struct task_s
if (*newval < *nextevt)
*nextevt = *newval;
 
-   tick_dep_set_signal(tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(tsk, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,

[patch 1/2] nohz: only wakeup a single target cpu when kicking a task

2020-10-08 Thread Marcelo Tosatti

When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which 
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

From: Frederic Weisbecker 
Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -274,6 +274,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -356,19 +381,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cp
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

[patch 0/2] nohz_full: only wakeup target CPUs when notifying new tick dependency (v3)

2020-10-08 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK). This patch changes the notification
to only IPI the target CPUs where the task(s) whose tick dependencies
are being updated are executing.

This reduces interruptions to nohz_full= CPUs.

---

v3: replace superfluous rcu_read_lock with lockdep_assert (PeterZ)

Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a task

2020-10-08 Thread Marcelo Tosatti

On Thu, Oct 08, 2020 at 02:22:56PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2020 at 03:01:52PM -0300, Marcelo Tosatti wrote:
> > When adding a tick dependency to a task, its necessary to
> > wakeup the CPU where the task resides to reevaluate tick
> > dependencies on that CPU.
> > 
> > However the current code wakes up all nohz_full CPUs, which 
> > is unnecessary.
> > 
> > Switch to waking up a single CPU, by using ordering of writes
> > to task->cpu and task->tick_dep_mask.
> > 
> > From: Frederic Weisbecker 
> > Suggested-by: Peter Zijlstra 
> > Signed-off-by: Frederic Weisbecker 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -274,6 +274,31 @@ void tick_nohz_full_kick_cpu(int cpu)
> > irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
> >  }
> >  
> > +static void tick_nohz_kick_task(struct task_struct *tsk)
> > +{
> > +   int cpu = task_cpu(tsk);
> > +
> > +   /*
> > +* If the task concurrently migrates to another cpu,
> > +* we guarantee it sees the new tick dependency upon
> > +* schedule.
> > +*
> > +*
> > +* set_task_cpu(p, cpu);
> > +*   STORE p->cpu = @cpu
> > +* __schedule() (switch to task 'p')
> > +*   LOCK rq->lock
> > +*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
> > +*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
> > +*  LOAD p->tick_dep_mask   LOAD p->cpu
> > +*/
> > +
> > +   preempt_disable();
> > +   if (cpu_online(cpu))
> > +   tick_nohz_full_kick_cpu(cpu);
> > +   preempt_enable();
> > +}
> 
> So we need to kick the CPU unconditionally, or only when the task is
> actually running? AFAICT we only care about current->tick_dep_mask.

tick is necessary to execute run_posix_cpu_timers, from tick interrupt, 
even if task is not running.

Re: [patch 2/2] nohz: change signal tick dependency to wakeup CPUs of member tasks

2020-10-08 Thread Marcelo Tosatti

On Thu, Oct 08, 2020 at 02:35:44PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2020 at 03:01:53PM -0300, Marcelo Tosatti wrote:
> > Rather than waking up all nohz_full CPUs on the system, only wakeup 
> > the target CPUs of member threads of the signal.
> > 
> > Reduces interruptions to nohz_full CPUs.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -398,7 +398,15 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
> >   */
> >  void tick_nohz_dep_set_signal(struct signal_struct *sig, enum 
> > tick_dep_bits bit)
> >  {
> > -   tick_nohz_dep_set_all(>tick_dep_mask, bit);
> > +   int prev;
> > +
> > +   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
> > +   if (!prev) {
> > +   rcu_read_lock();
> > +   for_each_thread(sig, t)
> > +   tick_nohz_kick_task(t);
> > +   rcu_read_unlock();
> > +   }
> >  }
> 
> AFAICT, and this makes perfect sense, this function is only ever used
> while holding sighand->siglock, which makes the RCU read lock
> superfluous.
> 
> Would it make sense to change the signal_struct argument to task_struct,
> such that we can write:
> 
>   lockdep_assert_held(>sighand->siglock);
>   for_each_thread(p->signal, t)
>   tick_nohz_kick_task(t);
> 
> ?

Makes sense, resending -v3.

Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a task

2020-10-08 Thread Marcelo Tosatti

On Thu, Oct 08, 2020 at 10:59:40AM -0400, Peter Xu wrote:
> On Wed, Oct 07, 2020 at 03:01:52PM -0300, Marcelo Tosatti wrote:
> > +static void tick_nohz_kick_task(struct task_struct *tsk)
> > +{
> > +   int cpu = task_cpu(tsk);
> > +
> > +   /*
> > +* If the task concurrently migrates to another cpu,
> > +* we guarantee it sees the new tick dependency upon
> > +* schedule.
> > +*
> > +*
> > +* set_task_cpu(p, cpu);
> > +*   STORE p->cpu = @cpu
> > +* __schedule() (switch to task 'p')
> > +*   LOCK rq->lock
> > +*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
> > +*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
> > +*  LOAD p->tick_dep_mask   LOAD p->cpu
> > +*/
> > +
> > +   preempt_disable();
> 
> Pure question: is preempt_disable() required here?  Same question to
> tick_nohz_full_kick_all().

Hi Peter,

Don't see why: irq_queue_work_on() disables preemption if necessary.

> 
> > +   if (cpu_online(cpu))
> > +   tick_nohz_full_kick_cpu(cpu);
> > +   preempt_enable();
> > +}
> 
> -- 
> Peter Xu

[patch 2/2] nohz: change signal tick dependency to wakeup CPUs of member tasks

2020-10-07 Thread Marcelo Tosatti

Rather than waking up all nohz_full CPUs on the system, only wakeup 
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -398,7 +398,15 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  */
 void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+
+   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
+   if (!prev) {
+   rcu_read_lock();
+   for_each_thread(sig, t)
+   tick_nohz_kick_task(t);
+   rcu_read_unlock();
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)

[patch 1/2] nohz: only wakeup a single target cpu when kicking a task

2020-10-07 Thread Marcelo Tosatti

When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which 
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

From: Frederic Weisbecker 
Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -274,6 +274,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -356,19 +381,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cp
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

[patch 0/2] nohz_full: only wakeup target CPUs when notifying new tick dependency (v2)

2020-10-07 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK). This patch changes the notification
to only IPI the target CPUs where the task(s) whose tick dependencies
are being updated are executing.

This reduces interruptions to nohz_full= CPUs.

Re: [RFC][Patch v1 2/3] i40e: limit msix vectors based on housekeeping CPUs

2020-09-11 Thread Marcelo Tosatti

On Wed, Sep 09, 2020 at 11:08:17AM -0400, Nitesh Narayan Lal wrote:
> In a realtime environment, it is essential to isolate unwanted IRQs from
> isolated CPUs to prevent latency overheads. Creating MSIX vectors only
> based on the online CPUs could lead to a potential issue on an RT setup
> that has several isolated CPUs but a very few housekeeping CPUs. This is
> because in these kinds of setups an attempt to move the IRQs to the
> limited housekeeping CPUs from isolated CPUs might fail due to the per
> CPU vector limit. This could eventually result in latency spikes because
> of the IRQ threads that we fail to move from isolated CPUs.
> 
> This patch prevents i40e to add vectors only based on available
> housekeeping CPUs by using num_housekeeping_cpus().
> 
> Signed-off-by: Nitesh Narayan Lal 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 2e433fdbf2c3..3b4cd4b3de85 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /* Local includes */
> @@ -11002,7 +11003,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
>* will use any remaining vectors to reach as close as we can to the
>* number of online CPUs.
>*/
> - cpus = num_online_cpus();
> + cpus = num_housekeeping_cpus();
>   pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
>   vectors_left -= pf->num_lan_msix;
>  
> -- 
> 2.27.0

For patches 1 and 2:

Reviewed-by: Marcelo Tosatti

Re: [RFC][Patch v1 3/3] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-10 Thread Marcelo Tosatti

On Wed, Sep 09, 2020 at 11:08:18AM -0400, Nitesh Narayan Lal wrote:
> This patch limits the pci_alloc_irq_vectors max vectors that is passed on
> by the caller based on the available housekeeping CPUs by only using the
> minimum of the two.
> 
> A minimum of the max_vecs passed and available housekeeping CPUs is
> derived to ensure that we don't create excess vectors which can be
> problematic specifically in an RT environment. This is because for an RT
> environment unwanted IRQs are moved to the housekeeping CPUs from
> isolated CPUs to keep the latency overhead to a minimum. If the number of
> housekeeping CPUs are significantly lower than that of the isolated CPUs
> we can run into failures while moving these IRQs to housekeeping due to
> per CPU vector limit.
> 
> Signed-off-by: Nitesh Narayan Lal 
> ---
>  include/linux/pci.h | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 835530605c0d..750ba927d963 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -38,6 +38,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -1797,6 +1798,21 @@ static inline int
>  pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> unsigned int max_vecs, unsigned int flags)
>  {
> + unsigned int num_housekeeping = num_housekeeping_cpus();
> + unsigned int num_online = num_online_cpus();
> +
> + /*
> +  * Try to be conservative and at max only ask for the same number of
> +  * vectors as there are housekeeping CPUs. However, skip any
> +  * modification to the of max vectors in two conditions:
> +  * 1. If the min_vecs requested are higher than that of the
> +  *housekeeping CPUs as we don't want to prevent the initialization
> +  *of a device.
> +  * 2. If there are no isolated CPUs as in this case the driver should
> +  *already have taken online CPUs into consideration.
> +  */
> + if (min_vecs < num_housekeeping && num_housekeeping != num_online)
> + max_vecs = min_t(int, max_vecs, num_housekeeping);
>   return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
> NULL);
>  }

If min_vecs > num_housekeeping, for example:

/* PCI MSI/MSIx support */
#define XGBE_MSI_BASE_COUNT 4
#define XGBE_MSI_MIN_COUNT  (XGBE_MSI_BASE_COUNT + 1)

Then the protection fails.

How about reducing max_vecs down to min_vecs, if min_vecs >
num_housekeeping ?

Re: [patch 2/2] nohz: try to avoid IPI when setting tick dependency for task

2020-09-10 Thread Marcelo Tosatti

On Thu, Sep 03, 2020 at 05:01:53PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2020 at 03:41:49PM -0300, Marcelo Tosatti wrote:
> > When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
> > performed (to re-read the dependencies and possibly not re-enter
> > nohz_full on a given CPU).
> > 
> > A common case is for applications that run on nohz_full= CPUs
> > to not use POSIX timers (eg DPDK).
> > 
> > This patch optimizes tick_nohz_dep_set_task to avoid kicking
> > all nohz_full= CPUs in case the task allowed mask does not
> > intersect with nohz_full= CPU mask,
> > when going through tick_nohz_dep_set_task.
> > 
> > This reduces interruptions to nohz_full= CPUs.
> > 
> > ---
> >  kernel/time/tick-sched.c |9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -383,11 +383,16 @@ void tick_nohz_dep_set_task(struct task_
> > tick_nohz_full_kick();
> > preempt_enable();
> > } else {
> > +   unsigned long flags;
> > +
> > /*
> >  * Some future tick_nohz_full_kick_task()
> > -* should optimize this.
> > +* should further optimize this.
> >  */
> > -   tick_nohz_full_kick_all();
> > +   raw_spin_lock_irqsave(>pi_lock, flags);
> > +   if (cpumask_intersects(>cpus_mask, 
> > tick_nohz_full_mask))
> > +   tick_nohz_full_kick_all();
> > +   raw_spin_unlock_irqrestore(>pi_lock, flags);
> > }
> > }
> >  }
> > 
> > 
> 
> Not long ago, Peterz suggested that we simply do:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index f0199a4ba1ad..42ce8e458013 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -357,17 +357,26 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);
>  void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
>  {
>   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
> - if (tsk == current) {
> - preempt_disable();
> - tick_nohz_full_kick();
> - preempt_enable();
> - } else {
> - /*
> -  * Some future tick_nohz_full_kick_task()
> -  * should optimize this.
> -  */
> - tick_nohz_full_kick_all();
> - }
> + int cpu = task_cpu(tsk);
> +
> + /*
> +  * If the task concurrently migrates to another cpu,
> +  * we guarantee it sees the new tick dependency upon
> +  * schedule.
> +  *
> +  * set_task_cpu(p, cpu);
> +  *   STORE p->cpu = @cpu
> +  * __schedule() (switch to task 'p')
> +  *   LOCK rq->lock
> +  *   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
> +  *   tick_nohz_task_switch()smp_mb() 
> (atomic_fetch_or())
> +  *  LOAD p->tick_dep_mask   LOAD p->cpu
> +  */
> +
> + preempt_disable();
> + if (cpu_online(cpu))
> + tick_nohz_full_kick_cpu(cpu);
> + preempt_enable();
>   }
>  }
>  EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

This can also be used for the signal case... thanks.

Re: Requirements to control kernel isolation/nohz_full at runtime

2020-09-03 Thread Marcelo Tosatti

On Thu, Sep 03, 2020 at 02:36:36PM -0400, Phil Auld wrote:
> On Thu, Sep 03, 2020 at 03:30:15PM -0300 Marcelo Tosatti wrote:
> > On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> > > On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > > > Hi,
> > > 
> > > Hi Frederic,
> > > 
> > > Thanks for the summary! Looking forward to your comments...
> > > 
> > > > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > > > and some other people seem to be interested as well. So I've dumped
> > > > a few thoughts about some pre-requirements to achieve that for those
> > > > interested.
> > > > 
> > > > As you can see, there is a bit of hard work in the way. I'm iterating
> > > > that in https://pad.kernel.org/p/isolation, feel free to edit:
> > > > 
> > > > 
> > > > == RCU nocb ==
> > > > 
> > > > Currently controllable with "rcu_nocbs=" boot parameter and/or through 
> > > > nohz_full=/isolcpus=nohz
> > > > We need to make it toggeable at runtime. Currently handling that:
> > > > v1: https://lwn.net/Articles/820544/
> > > > v2: coming soon
> > > 
> > > Nice.
> > > 
> > > > == TIF_NOHZ ==
> > > > 
> > > > Need to get rid of that in order not to trigger syscall slowpath on 
> > > > CPUs that don't want nohz_full.
> > > > Also we don't want to iterate all threads and clear the flag when the 
> > > > last nohz_full CPU exits nohz_full
> > > > mode. Prefer static keys to call context tracking on archs. x86 does 
> > > > that well.
> > > > 
> > > > == Proper entry code ==
> > > > 
> > > > We must make sure that a given arch never calls exception_enter() / 
> > > > exception_exit().
> > > > This saves the previous state of context tracking and switch to kernel 
> > > > mode (from context tracking POV)
> > > > temporarily. Since this state is saved on the stack, this prevents us 
> > > > from turning off context tracking
> > > > entirely on a CPU: The tracking must be done on all CPUs and that takes 
> > > > some cycles.
> > > > 
> > > > This means that, considering early entry code (before the call to 
> > > > context tracking upon kernel entry,
> > > > and after the call to context tracking upon kernel exit), we must take 
> > > > care of few things:
> > > > 
> > > > 1) Make sure early entry code can't trigger exceptions. Or if it does, 
> > > > the given exception can't schedule
> > > > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception 
> > > > must call exception_enter()/exception_exit()
> > > > which we don't want.
> > > > 
> > > > 2) No call to schedule_user().
> > > > 
> > > > 3) Make sure early entry code is not interruptible or 
> > > > preempt_schedule_irq() would rely on
> > > > exception_entry()/exception_exit()
> > > > 
> > > > 4) Make sure early entry code can't be traced (no call to 
> > > > preempt_schedule_notrace()), or if it does it
> > > > can't schedule
> > > > 
> > > > I believe x86 does most of that well. In the end we should remove 
> > > > exception_enter()/exit implementations
> > > > in x86 and replace it with a check that makes sure context_tracking 
> > > > state is not in USER. An arch meeting
> > > > all the above conditions would earn a 
> > > > CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> > > > at runtime would depend on that.
> > > > 
> > > > 
> > > > == Cputime accounting ==
> > > > 
> > > > Both write and read side must switch to tick based accounting and drop 
> > > > the use of seqlock in task_cputime(),
> > > > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special 
> > > > ordering/state machine is required to make that without races.
> > > > 
> > > > == Nohz ==
> > > > 
> > > > Switch from nohz_full to nohz_idle. Mind a few details:
> > > > 
> > > > 1) Turn off 1Hz offlined tick handled in housekeeping
> > > > 2) Handle tick dependencies, take care of racing CPUs 
> > > > setting/clearing tick dependenc

Re: Requirements to control kernel isolation/nohz_full at runtime

2020-09-03 Thread Marcelo Tosatti

On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > Hi,
> 
> Hi Frederic,
> 
> Thanks for the summary! Looking forward to your comments...
> 
> > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > and some other people seem to be interested as well. So I've dumped
> > a few thoughts about some pre-requirements to achieve that for those
> > interested.
> > 
> > As you can see, there is a bit of hard work in the way. I'm iterating
> > that in https://pad.kernel.org/p/isolation, feel free to edit:
> > 
> > 
> > == RCU nocb ==
> > 
> > Currently controllable with "rcu_nocbs=" boot parameter and/or through 
> > nohz_full=/isolcpus=nohz
> > We need to make it toggeable at runtime. Currently handling that:
> > v1: https://lwn.net/Articles/820544/
> > v2: coming soon
> 
> Nice.
> 
> > == TIF_NOHZ ==
> > 
> > Need to get rid of that in order not to trigger syscall slowpath on CPUs 
> > that don't want nohz_full.
> > Also we don't want to iterate all threads and clear the flag when the last 
> > nohz_full CPU exits nohz_full
> > mode. Prefer static keys to call context tracking on archs. x86 does that 
> > well.
> > 
> > == Proper entry code ==
> > 
> > We must make sure that a given arch never calls exception_enter() / 
> > exception_exit().
> > This saves the previous state of context tracking and switch to kernel mode 
> > (from context tracking POV)
> > temporarily. Since this state is saved on the stack, this prevents us from 
> > turning off context tracking
> > entirely on a CPU: The tracking must be done on all CPUs and that takes 
> > some cycles.
> > 
> > This means that, considering early entry code (before the call to context 
> > tracking upon kernel entry,
> > and after the call to context tracking upon kernel exit), we must take care 
> > of few things:
> > 
> > 1) Make sure early entry code can't trigger exceptions. Or if it does, the 
> > given exception can't schedule
> > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must 
> > call exception_enter()/exception_exit()
> > which we don't want.
> > 
> > 2) No call to schedule_user().
> > 
> > 3) Make sure early entry code is not interruptible or 
> > preempt_schedule_irq() would rely on
> > exception_entry()/exception_exit()
> > 
> > 4) Make sure early entry code can't be traced (no call to 
> > preempt_schedule_notrace()), or if it does it
> > can't schedule
> > 
> > I believe x86 does most of that well. In the end we should remove 
> > exception_enter()/exit implementations
> > in x86 and replace it with a check that makes sure context_tracking state 
> > is not in USER. An arch meeting
> > all the above conditions would earn a 
> > CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> > at runtime would depend on that.
> > 
> > 
> > == Cputime accounting ==
> > 
> > Both write and read side must switch to tick based accounting and drop the 
> > use of seqlock in task_cputime(),
> > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special 
> > ordering/state machine is required to make that without races.
> > 
> > == Nohz ==
> > 
> > Switch from nohz_full to nohz_idle. Mind a few details:
> > 
> > 1) Turn off 1Hz offlined tick handled in housekeeping
> > 2) Handle tick dependencies, take care of racing CPUs setting/clearing 
> > tick dependency. It's much trickier when
> > we switch from nohz_idle to nohz_full
> > 
> > == Unbound affinity ==
> > 
> > Restore kernel threads, workqueue, timers, etc... wide affinity. But take 
> > care of cpumasks that have been set through other
> > interfaces: sysfs, procfs, etc...
> 
> We were looking at a userspace interface: what would be a proper
> (unified, similar to isolcpus= interface) and its implementation:
> 
> The simplest idea for interface seemed to be exposing the integer list of
> CPUs and isolation flags to userspace (probably via sysfs).
> 
> The scheme would allow flags to be separately enabled/disabled, 
> with not all flags being necessary toggable (could for example
> disallow nohz_full= toggling until it is implemented, but allow for
> other isolation features to be toggable).
> 
> This would require per flag housekeeping_masks (instead of a single).
> 
> Back to the userspace interface, you mentioned earlier that cpu

Re: Requirements to control kernel isolation/nohz_full at runtime

2020-09-03 Thread Marcelo Tosatti

On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> Hi,

Hi Frederic,

Thanks for the summary! Looking forward to your comments...

> I'm currently working on making nohz_full/nohz_idle runtime toggable
> and some other people seem to be interested as well. So I've dumped
> a few thoughts about some pre-requirements to achieve that for those
> interested.
> 
> As you can see, there is a bit of hard work in the way. I'm iterating
> that in https://pad.kernel.org/p/isolation, feel free to edit:
> 
> 
> == RCU nocb ==
> 
> Currently controllable with "rcu_nocbs=" boot parameter and/or through 
> nohz_full=/isolcpus=nohz
> We need to make it toggeable at runtime. Currently handling that:
> v1: https://lwn.net/Articles/820544/
> v2: coming soon

Nice.

> == TIF_NOHZ ==
> 
> Need to get rid of that in order not to trigger syscall slowpath on CPUs that 
> don't want nohz_full.
> Also we don't want to iterate all threads and clear the flag when the last 
> nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does that 
> well.
> 
> == Proper entry code ==
> 
> We must make sure that a given arch never calls exception_enter() / 
> exception_exit().
> This saves the previous state of context tracking and switch to kernel mode 
> (from context tracking POV)
> temporarily. Since this state is saved on the stack, this prevents us from 
> turning off context tracking
> entirely on a CPU: The tracking must be done on all CPUs and that takes some 
> cycles.
> 
> This means that, considering early entry code (before the call to context 
> tracking upon kernel entry,
> and after the call to context tracking upon kernel exit), we must take care 
> of few things:
> 
> 1) Make sure early entry code can't trigger exceptions. Or if it does, the 
> given exception can't schedule
> or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must 
> call exception_enter()/exception_exit()
> which we don't want.
> 
> 2) No call to schedule_user().
> 
> 3) Make sure early entry code is not interruptible or preempt_schedule_irq() 
> would rely on
> exception_entry()/exception_exit()
> 
> 4) Make sure early entry code can't be traced (no call to 
> preempt_schedule_notrace()), or if it does it
> can't schedule
> 
> I believe x86 does most of that well. In the end we should remove 
> exception_enter()/exit implementations
> in x86 and replace it with a check that makes sure context_tracking state is 
> not in USER. An arch meeting
> all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. 
> Being able to toggle nohz_full
> at runtime would depend on that.
> 
> 
> == Cputime accounting ==
> 
> Both write and read side must switch to tick based accounting and drop the 
> use of seqlock in task_cputime(),
> task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state 
> machine is required to make that without races.
> 
> == Nohz ==
> 
> Switch from nohz_full to nohz_idle. Mind a few details:
> 
> 1) Turn off 1Hz offlined tick handled in housekeeping
> 2) Handle tick dependencies, take care of racing CPUs setting/clearing 
> tick dependency. It's much trickier when
> we switch from nohz_idle to nohz_full
> 
> == Unbound affinity ==
> 
> Restore kernel threads, workqueue, timers, etc... wide affinity. But take 
> care of cpumasks that have been set through other
> interfaces: sysfs, procfs, etc...

We were looking at a userspace interface: what would be a proper
(unified, similar to isolcpus= interface) and its implementation:

The simplest idea for interface seemed to be exposing the integer list of
CPUs and isolation flags to userspace (probably via sysfs).

The scheme would allow flags to be separately enabled/disabled, 
with not all flags being necessary toggable (could for example
disallow nohz_full= toggling until it is implemented, but allow for
other isolation features to be toggable).

This would require per flag housekeeping_masks (instead of a single).

Back to the userspace interface, you mentioned earlier that cpusets
was a possibility for it. However:

"Cpusets provide a Linux kernel mechanism to constrain which CPUs and
Memory Nodes are used by a process or set of processes.

The Linux kernel already has a pair of mechanisms to specify on which
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
Nodes it may obtain memory (mbind, set_mempolicy).

Cpusets extends these two mechanisms as follows:"

The isolation flags do not necessarily have anything to do with
tasks, but with CPUs: a given feature is disabled or enabled on a
given CPU. 
No?

---

Regarding locking of the masks, since housekeeping_masks can be called
from hot paths (eg: get_nohz_timer_target) it seems RCU is a natural
fit, so userspace would:

1) use interface to change cpumask for a given feature:

-> set_rcu_pointer
-> wait for grace period

2) proceed to trigger actions that rely on

Re: [patch 1/2] nohz: try to avoid IPI when configuring per-CPU posix timer

2020-09-03 Thread Marcelo Tosatti

On Wed, Sep 02, 2020 at 01:38:59AM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2020 at 03:41:48PM -0300, Marcelo Tosatti wrote:
> > When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
> > performed (to re-read the dependencies and possibly not re-enter
> > nohz_full on a given CPU).
> > 
> > A common case is for applications that run on nohz_full= CPUs 
> > to not use POSIX timers (eg DPDK). This patch skips the IPI 
> > in case the task allowed mask does not intersect with nohz_full= CPU mask,
> > when going through tick_nohz_dep_set_signal.
> > 
> > This reduces interruptions to nohz_full= CPUs.
> > 
> > Signed-off-by: Marcelo Tosatti 
> [...]
> >  /*
> > + * Set bit on nohz full dependency, kicking all cpus
> > + * only if task can run on nohz full CPUs.
> > + */
> > +static void tick_nohz_dep_set_all_cond(struct task_struct *tsk,
> > +  atomic_t *dep,
> > +  enum tick_dep_bits bit)
> > +{
> > +   int prev;
> > +   unsigned long flags;
> > +
> > +   prev = atomic_fetch_or(BIT(bit), dep);
> > +   if (prev)
> > +   return;
> > +
> > +   raw_spin_lock_irqsave(>pi_lock, flags);
> > +   if (cpumask_intersects(>cpus_mask, tick_nohz_full_mask))
> > +   tick_nohz_full_kick_all();
> 
> So that's for one task but what about the other threads in that
> process? We are setting the tick dependency on all tasks sharing that
> struct signal.

Hi Frederic,

Yep, fixing in -v2, thanks.

[patch 1/2] nohz: try to avoid IPI when configuring per-CPU posix timer

2020-08-25 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs 
to not use POSIX timers (eg DPDK). This patch skips the IPI 
in case the task allowed mask does not intersect with nohz_full= CPU mask,
when going through tick_nohz_dep_set_signal.

This reduces interruptions to nohz_full= CPUs.

Signed-off-by: Marcelo Tosatti 

---
 include/linux/tick.h   |   11 +++
 kernel/time/posix-cpu-timers.c |4 ++--
 kernel/time/tick-sched.c   |   27 +--
 3 files changed, 34 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/tick.h
===
--- linux-2.6.orig/include/linux/tick.h
+++ linux-2.6/include/linux/tick.h
@@ -207,7 +207,8 @@ extern void tick_nohz_dep_set_task(struc
   enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_task(struct task_struct *tsk,
 enum tick_dep_bits bit);
-extern void tick_nohz_dep_set_signal(struct signal_struct *signal,
+extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
+struct signal_struct *signal,
 enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
   enum tick_dep_bits bit);
@@ -252,11 +253,12 @@ static inline void tick_dep_clear_task(s
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
 }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
+  struct signal_struct *signal,
   enum tick_dep_bits bit)
 {
if (tick_nohz_full_enabled())
-   tick_nohz_dep_set_signal(signal, bit);
+   tick_nohz_dep_set_signal(tsk, signal, bit);
 }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit)
@@ -284,7 +286,8 @@ static inline void tick_dep_set_task(str
 enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_task(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
+  struct signal_struct *signal,
   enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit) { }
Index: linux-2.6/kernel/time/posix-cpu-timers.c
===
--- linux-2.6.orig/kernel/time/posix-cpu-timers.c
+++ linux-2.6/kernel/time/posix-cpu-timers.c
@@ -523,7 +523,7 @@ static void arm_timer(struct k_itimer *t
if (CPUCLOCK_PERTHREAD(timer->it_clock))
tick_dep_set_task(p, TICK_DEP_BIT_POSIX_TIMER);
else
-   tick_dep_set_signal(p->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(p, p->signal, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 /*
@@ -1358,7 +1358,7 @@ void set_process_cpu_timer(struct task_s
if (*newval < *nextevt)
*nextevt = *newval;
 
-   tick_dep_set_signal(tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(tsk, tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -302,6 +302,27 @@ static void tick_nohz_dep_set_all(atomic
 }
 
 /*
+ * Set bit on nohz full dependency, kicking all cpus
+ * only if task can run on nohz full CPUs.
+ */
+static void tick_nohz_dep_set_all_cond(struct task_struct *tsk,
+  atomic_t *dep,
+  enum tick_dep_bits bit)
+{
+   int prev;
+   unsigned long flags;
+
+   prev = atomic_fetch_or(BIT(bit), dep);
+   if (prev)
+   return;
+
+   raw_spin_lock_irqsave(>pi_lock, flags);
+   if (cpumask_intersects(>cpus_mask, tick_nohz_full_mask))
+   tick_nohz_full_kick_all();
+   raw_spin_unlock_irqrestore(>pi_lock, flags);
+}
+
+/*
  * Set a global tick dependency. Used by perf events that rely on freq and
  * by unstable clock.
  */
@@ -382,9 +403,11 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  * Set a per-taskgroup tick dependency. Posix CPU timers need this

[patch 2/2] nohz: try to avoid IPI when setting tick dependency for task

2020-08-25 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK).

This patch optimizes tick_nohz_dep_set_task to avoid kicking
all nohz_full= CPUs in case the task allowed mask does not
intersect with nohz_full= CPU mask,
when going through tick_nohz_dep_set_task.

This reduces interruptions to nohz_full= CPUs.

---
 kernel/time/tick-sched.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -383,11 +383,16 @@ void tick_nohz_dep_set_task(struct task_
tick_nohz_full_kick();
preempt_enable();
} else {
+   unsigned long flags;
+
/*
 * Some future tick_nohz_full_kick_task()
-* should optimize this.
+* should further optimize this.
 */
-   tick_nohz_full_kick_all();
+   raw_spin_lock_irqsave(>pi_lock, flags);
+   if (cpumask_intersects(>cpus_mask, 
tick_nohz_full_mask))
+   tick_nohz_full_kick_all();
+   raw_spin_unlock_irqrestore(>pi_lock, flags);
}
}
 }

[patch 0/2] posix-timers: avoid nohz_full= IPIs via task cpu masks

2020-08-25 Thread Marcelo Tosatti

This patchset avoids IPIs to nohz_full= CPUs when the intersection 
between the set of nohz_full CPUs and task allowed cpus is null.

See individual patches for details.

Re: [PATCH v1 0/3] Preventing job distribution to isolated CPUs

2020-06-16 Thread Marcelo Tosatti

Hi Nitesh,

On Wed, Jun 10, 2020 at 12:12:23PM -0400, Nitesh Narayan Lal wrote:
> This patch-set is originated from one of the patches that have been
> posted earlier as a part of "Task_isolation" mode [1] patch series
> by Alex Belits . There are only a couple of
> changes that I am proposing in this patch-set compared to what Alex
> has posted earlier.
> 
> 
> Context
> ===
> On a broad level, all three patches that are included in this patch
> set are meant to improve the driver/library to respect isolated
> CPUs by not pinning any job on it. Not doing so could impact
> the latency values in RT use-cases.
> 
> 
> Patches
> ===
> * Patch1:
>   The first patch is meant to make cpumask_local_spread()
>   aware of the isolated CPUs. It ensures that the CPUs that
>   are returned by this API only includes housekeeping CPUs.
> 
> * Patch2:
>   This patch ensures that a probe function that is called
>   using work_on_cpu() doesn't run any task on an isolated CPU.
> 
> * Patch3:
>   This patch makes store_rps_map() aware of the isolated
>   CPUs so that rps don't queue any jobs on an isolated CPU.
> 
> 
> Changes
> ===
> To fix the above-mentioned issues Alex has used housekeeping_cpumask().
> The only changes that I am proposing here are:
> - Removing the dependency on CONFIG_TASK_ISOLATION that was proposed by Alex.
>   As it should be safe to rely on housekeeping_cpumask()
>   even when we don't have any isolated CPUs and we want
>   to fall back to using all available CPUs in any of the above scenarios.
> - Using both HK_FLAG_DOMAIN and HK_FLAG_WQ in all three patches, this is
>   because we would want the above fixes not only when we have isolcpus but
>   also with something like systemd's CPU affinity.
> 
> 
> Testing
> ===
> * Patch 1:
>   Fix for cpumask_local_spread() is tested by creating VFs, loading
>   iavf module and by adding a tracepoint to confirm that only housekeeping
>   CPUs are picked when an appropriate profile is set up and all remaining CPUs
>   when no CPU isolation is required/configured.
> 
> * Patch 2:
>   To test the PCI fix, I hotplugged a virtio-net-pci from qemu console
>   and forced its addition to a specific node to trigger the code path that
>   includes the proposed fix and verified that only housekeeping CPUs
>   are included via tracepoint. I understand that this may not be the
>   best way to test it, hence, I am open to any suggestion to test this
>   fix in a better way if required.
> 
> * Patch 3:
>   To test the fix in store_rps_map(), I tried configuring an isolated
>   CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which
>   resulted in 'write error: Invalid argument' error. For the case
>   where a non-isolated CPU is writing in rps_cpus the above operation
>   succeeded without any error.
> 
> [1] 
> https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.ca...@marvell.com/
> 
> Alex Belits (3):
>   lib: restricting cpumask_local_spread to only houskeeping CPUs
>   PCI: prevent work_on_cpu's probe to execute on isolated CPUs
>   net: restrict queuing of receive packets to housekeeping CPUs
> 
>  drivers/pci/pci-driver.c |  5 -
>  lib/cpumask.c| 43 +++-
>  net/core/net-sysfs.c | 10 +-
>  3 files changed, 38 insertions(+), 20 deletions(-)
> 
> --  
> 

Looks good to me.

The flags mechanism is not well organized: this is using HK_FLAG_WQ to 
infer nohz_full is being set (while HK_FLAG_WQ should indicate that
non-affined workqueue threads should not run on certain CPUs).

But this is a problem of the flags (which apparently Frederic wants
to fix by exposing a limited number of options to users), and not
of this patch.

[tip: sched/core] kthread: Switch to cpu_possible_mask

2020-06-16 Thread tip-bot2 for Marcelo Tosatti

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 043eb8e1051143a24811e6f35c276e35ae8247b6
Gitweb:
https://git.kernel.org/tip/043eb8e1051143a24811e6f35c276e35ae8247b6
Author:Marcelo Tosatti 
AuthorDate:Wed, 27 May 2020 16:29:08 +02:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 15 Jun 2020 14:10:03 +02:00

kthread: Switch to cpu_possible_mask

Next patch will switch unbound kernel threads mask to
housekeeping_cpumask(), a subset of cpu_possible_mask. So in order to
ease bisection, lets first switch kthreads default affinity from
cpu_all_mask to cpu_possible_mask.

It looks safe to do so as cpu_possible_mask seem to be initialized
at setup_arch() time, way before kthreadd is created.

Suggested-by: Frederic Weisbecker 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200527142909.23372-2-frede...@kernel.org
---
 kernel/kthread.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8e3d2d7..b86d37c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -383,7 +383,7 @@ struct task_struct *__kthread_create_on_node(int 
(*threadfn)(void *data),
 * The kernel thread should not inherit these properties.
 */
sched_setscheduler_nocheck(task, SCHED_NORMAL, );
-   set_cpus_allowed_ptr(task, cpu_all_mask);
+   set_cpus_allowed_ptr(task, cpu_possible_mask);
}
kfree(create);
return task;
@@ -608,7 +608,7 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
-   set_cpus_allowed_ptr(tsk, cpu_all_mask);
+   set_cpus_allowed_ptr(tsk, cpu_possible_mask);
set_mems_allowed(node_states[N_MEMORY]);
 
current->flags |= PF_NOFREEZE;

[tip: sched/core] isolcpus: Affine unbound kernel threads to housekeeping cpus

2020-06-16 Thread tip-bot2 for Marcelo Tosatti

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9cc5b8656892a72438ee7deb5e80f5be47643b8b
Gitweb:
https://git.kernel.org/tip/9cc5b8656892a72438ee7deb5e80f5be47643b8b
Author:Marcelo Tosatti 
AuthorDate:Wed, 27 May 2020 16:29:09 +02:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 15 Jun 2020 14:10:03 +02:00

isolcpus: Affine unbound kernel threads to housekeeping cpus

This is a kernel enhancement that configures the cpu affinity of kernel
threads via kernel boot option nohz_full=.

When this option is specified, the cpumask is immediately applied upon
kthread launch. This does not affect kernel threads that specify cpu
and node.

This allows CPU isolation (that is not allowing certain threads
to execute on certain CPUs) without using the isolcpus=domain parameter,
making it possible to enable load balancing on such CPUs
during runtime (see kernel-parameters.txt).

Note-1: this is based off on Wind River's patch at
https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch

Difference being that this patch is limited to modifying kernel thread
cpumask. Behaviour of other threads can be controlled via cgroups or
sched_setaffinity.

Note-2: Wind River's patch was based off Christoph Lameter's patch at
https://lwn.net/Articles/565932/ with the only difference being
the kernel parameter changed from kthread to kthread_cpus.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200527142909.23372-3-frede...@kernel.org
---
 include/linux/sched/isolation.h | 1 +
 kernel/kthread.c| 6 --
 kernel/sched/isolation.c| 3 ++-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 0fbcbac..cc9f393 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -14,6 +14,7 @@ enum hk_flags {
HK_FLAG_DOMAIN  = (1 << 5),
HK_FLAG_WQ  = (1 << 6),
HK_FLAG_MANAGED_IRQ = (1 << 7),
+   HK_FLAG_KTHREAD = (1 << 8),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/kthread.c b/kernel/kthread.c
index b86d37c..032b610 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -383,7 +384,8 @@ struct task_struct *__kthread_create_on_node(int 
(*threadfn)(void *data),
 * The kernel thread should not inherit these properties.
 */
sched_setscheduler_nocheck(task, SCHED_NORMAL, );
-   set_cpus_allowed_ptr(task, cpu_possible_mask);
+   set_cpus_allowed_ptr(task,
+housekeeping_cpumask(HK_FLAG_KTHREAD));
}
kfree(create);
return task;
@@ -608,7 +610,7 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
-   set_cpus_allowed_ptr(tsk, cpu_possible_mask);
+   set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_FLAG_KTHREAD));
set_mems_allowed(node_states[N_MEMORY]);
 
current->flags |= PF_NOFREEZE;
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 808244f..5a6ea03 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -140,7 +140,8 @@ static int __init housekeeping_nohz_full_setup(char *str)
 {
unsigned int flags;
 
-   flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | 
HK_FLAG_MISC;
+   flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
+   HK_FLAG_MISC | HK_FLAG_KTHREAD;
 
return housekeeping_setup(str, flags);
 }

Re: [PATCH v5 2/7] fpga: dfl: pci: add irq info for feature devices enumeration

2020-05-25 Thread Marcelo Tosatti

cdev)) {
>   dev_err(>dev, "Enumeration failure\n");
>   ret = PTR_ERR(cdev);
> - goto enum_info_free_exit;
> + goto irq_free_exit;
>   }
>  
>   drvdata->cdev = cdev;
>  
> +irq_free_exit:
> + if (ret)
> + cci_pci_free_irq(pcidev);
>  enum_info_free_exit:
>   dfl_fpga_enum_info_free(info);
>  
> @@ -211,12 +275,10 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct 
> pci_device_id *pcidevid)
>   }
>  
>   ret = cci_enumerate_feature_devs(pcidev);
> - if (ret) {
> - dev_err(>dev, "enumeration failure %d.\n", ret);
> - goto disable_error_report_exit;
> - }
> + if (!ret)
> + return ret;
>  
> - return ret;
> + dev_err(>dev, "enumeration failure %d.\n", ret);
>  
>  disable_error_report_exit:
>   pci_disable_pcie_error_reporting(pcidev);
> -- 
> 2.7.4


Reviewed-by: Marcelo Tosatti

Re: [PATCH v5 4/7] fpga: dfl: afu: add interrupt support for port error reporting

2020-05-25 Thread Marcelo Tosatti

 dfl_fpga_irq_set)
> + *
> + * Set fpga port error reporting interrupt trigger if evtfds[n] is valid.
> + * Unset related interrupt trigger if evtfds[n] is a negative value.
> + * Return: 0 on success, -errno on failure.
> + */
> +#define DFL_FPGA_PORT_ERR_SET_IRQ_IOW(DFL_FPGA_MAGIC,\
> +  DFL_PORT_BASE + 6, \
> +  struct dfl_fpga_irq_set)
> +
>  /* IOCTLs for FME file descriptor */
>  
>  /**
> -- 
> 2.7.4


Reviewed-by: Marcelo Tosatti

Re: [PATCH v5 6/7] fpga: dfl: afu: add AFU interrupt support

2020-05-25 Thread Marcelo Tosatti

On Mon, Apr 20, 2020 at 04:11:42PM +0800, Xu Yilun wrote:
> AFU (Accelerated Function Unit) is dynamic region of the DFL based FPGA,
> and always defined by users. Some DFL based FPGA cards allow users to
> implement their own interrupts in AFU. In order to support this,
> hardware implements a new UINT (AFU Interrupt) private feature with
> related capability register which describes the number of supported
> AFU interrupts as well as the local index of the interrupts for
> software enumeration, and from software side, driver follows the common
> DFL interrupt notification and handling mechanism, and it implements
> two ioctls below for user to query number of irqs supported and set/unset
> interrupt triggers.
> 
>  Ioctls:
>  * DFL_FPGA_PORT_UINT_GET_IRQ_NUM
>get the number of irqs, which is used to determine how many interrupts
>UINT feature supports.
> 
>  * DFL_FPGA_PORT_UINT_SET_IRQ
>set/unset eventfds as AFU interrupt triggers.
> 
> Signed-off-by: Luwei Kang 
> Signed-off-by: Wu Hao 
> Signed-off-by: Xu Yilun 
> Acked-by: Wu Hao 
> 
> v2: use DFL_FPGA_PORT_UINT_GET_IRQ_NUM instead of
> DFL_FPGA_PORT_UINT_GET_INFO
> Delete flags field for DFL_FPGA_PORT_UINT_SET_IRQ
> v3: put_user() instead of copy_to_user()
> improves comments
> v4: use common functions to handle irq ioctls
> v5: Minor fixes for Hao's comments
> ---
>  drivers/fpga/dfl-afu-main.c   | 28 
>  include/uapi/linux/fpga-dfl.h | 23 +++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index b1ed7b4..753cda4 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -530,6 +530,30 @@ static const struct dfl_feature_ops port_stp_ops = {
>   .init = port_stp_init,
>  };
>  
> +static long
> +port_uint_ioctl(struct platform_device *pdev, struct dfl_feature *feature,
> + unsigned int cmd, unsigned long arg)
> +{
> + switch (cmd) {
> + case DFL_FPGA_PORT_UINT_GET_IRQ_NUM:
> + return dfl_feature_ioctl_get_num_irqs(pdev, feature, arg);
> + case DFL_FPGA_PORT_UINT_SET_IRQ:
> + return dfl_feature_ioctl_set_irq(pdev, feature, arg);
> + default:
> + dev_dbg(>dev, "%x cmd not handled", cmd);
> + return -ENODEV;
> + }
> +}
> +
> +static const struct dfl_feature_id port_uint_id_table[] = {
> + {.id = PORT_FEATURE_ID_UINT,},
> + {0,}
> +};
> +
> +static const struct dfl_feature_ops port_uint_ops = {
> + .ioctl = port_uint_ioctl,
> +};
> +
>  static struct dfl_feature_driver port_feature_drvs[] = {
>   {
>   .id_table = port_hdr_id_table,
> @@ -548,6 +572,10 @@ static struct dfl_feature_driver port_feature_drvs[] = {
>   .ops = _stp_ops,
>   },
>   {
> + .id_table = port_uint_id_table,
> + .ops = _uint_ops,
> + },
> + {
>   .ops = NULL,
>   }
>  };
> diff --git a/include/uapi/linux/fpga-dfl.h b/include/uapi/linux/fpga-dfl.h
> index b6495ea..1621b07 100644
> --- a/include/uapi/linux/fpga-dfl.h
> +++ b/include/uapi/linux/fpga-dfl.h
> @@ -187,6 +187,29 @@ struct dfl_fpga_irq_set {
>DFL_PORT_BASE + 6, \
>struct dfl_fpga_irq_set)
>  
> +/**
> + * DFL_FPGA_PORT_UINT_GET_IRQ_NUM - _IOR(DFL_FPGA_MAGIC, DFL_PORT_BASE + 7,
> + *   __u32 num_irqs)
> + *
> + * Get the number of irqs supported by the fpga AFU interrupt private
> + * feature.
> + * Return: 0 on success, -errno on failure.
> + */
> +#define DFL_FPGA_PORT_UINT_GET_IRQ_NUM   _IOR(DFL_FPGA_MAGIC,\
> +  DFL_PORT_BASE + 7, __u32)
> +
> +/**
> + * DFL_FPGA_PORT_UINT_SET_IRQ - _IOW(DFL_FPGA_MAGIC, DFL_PORT_BASE + 8,
> + *   struct dfl_fpga_irq_set)
> + *
> + * Set fpga AFU interrupt trigger if evtfds[n] is valid.
> + * Unset related interrupt trigger if evtfds[n] is a negative value.
> + * Return: 0 on success, -errno on failure.
> + */
> +#define DFL_FPGA_PORT_UINT_SET_IRQ   _IOW(DFL_FPGA_MAGIC,\
> +  DFL_PORT_BASE + 8, \
> +  struct dfl_fpga_irq_set)
> +
>  /* IOCTLs for FME file descriptor */
>  
>  /**
> -- 
> 2.7.4


Reviewed-by: Marcelo Tosatti

Re: [PATCH v5 7/7] Documentation: fpga: dfl: add descriptions for interrupt related interfaces.

2020-05-25 Thread Marcelo Tosatti

On Mon, Apr 20, 2020 at 04:11:43PM +0800, Xu Yilun wrote:
> This patch adds introductions of interrupt related interfaces for FME
> error reporting, port error reporting and AFU user interrupts features.
> 
> Signed-off-by: Luwei Kang 
> Signed-off-by: Wu Hao 
> Signed-off-by: Xu Yilun 
> Acked-by: Wu Hao 
> 
> v2: Update Documents cause change of irq ioctl interfaces.
> v3: No change
> v4: Update interrupt support part.
> v5: No change
> ---
>  Documentation/fpga/dfl.rst | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
> index 094fc8a..702bf62 100644
> --- a/Documentation/fpga/dfl.rst
> +++ b/Documentation/fpga/dfl.rst
> @@ -89,6 +89,8 @@ The following functions are exposed through ioctls:
>  - Program bitstream (DFL_FPGA_FME_PORT_PR)
>  - Assign port to PF (DFL_FPGA_FME_PORT_ASSIGN)
>  - Release port from PF (DFL_FPGA_FME_PORT_RELEASE)
> +- Get number of irqs of FME global error (DFL_FPGA_FME_ERR_GET_IRQ_NUM)
> +- Set interrupt trigger for FME error (DFL_FPGA_FME_ERR_SET_IRQ)
>  
>  More functions are exposed through sysfs
>  (/sys/class/fpga_region/regionX/dfl-fme.n/):
> @@ -144,6 +146,10 @@ The following functions are exposed through ioctls:
>  - Map DMA buffer (DFL_FPGA_PORT_DMA_MAP)
>  - Unmap DMA buffer (DFL_FPGA_PORT_DMA_UNMAP)
>  - Reset AFU (DFL_FPGA_PORT_RESET)
> +- Get number of irqs of port error (DFL_FPGA_PORT_ERR_GET_IRQ_NUM)
> +- Set interrupt trigger for port error (DFL_FPGA_PORT_ERR_SET_IRQ)
> +- Get number of irqs of UINT (DFL_FPGA_PORT_UINT_GET_IRQ_NUM)
> +- Set interrupt trigger for UINT (DFL_FPGA_PORT_UINT_SET_IRQ)
>  
>  DFL_FPGA_PORT_RESET:
>reset the FPGA Port and its AFU. Userspace can do Port
> @@ -378,6 +384,19 @@ The device nodes used for ioctl() or mmap() can be 
> referenced through::
>   /sys/class/fpga_region///dev
>  
>  
> +Interrupt support
> +=
> +Some FME and AFU private features are able to generate interrupts. As 
> mentioned
> +above, users could call ioctl (DFL_FPGA_*_GET_IRQ_NUM) to know whether or how
> +many interrupts are supported for this private feature. Drivers also 
> implement
> +an eventfd based interrupt handling mechanism for users to get notified when
> +interrupt happens. Users could set eventfds to driver via
> +ioctl (DFL_FPGA_*_SET_IRQ), and then poll/select on these eventfds waiting 
> for
> +notification.
> +In Current DFL, 3 sub features (Port error, FME global error and AFU 
> interrupt)
> +support interrupts.
> +
> +
>  Add new FIUs support
>  
>  It's possible that developers made some new function blocks (FIUs) under this
> -- 
> 2.7.4


Reviewed-by: Marcelo Tosatti

Re: [PATCH v5 1/7] fpga: dfl: parse interrupt info for feature devices on enumeration

2020-05-25 Thread Marcelo Tosatti

  devm_kfree(dev, info);
>   put_device(dev);
>  }
> @@ -892,6 +1003,45 @@ int dfl_fpga_enum_info_add_dfl(struct 
> dfl_fpga_enum_info *info,
>  }
>  EXPORT_SYMBOL_GPL(dfl_fpga_enum_info_add_dfl);
>  
> +/**
> + * dfl_fpga_enum_info_add_irq - add irq table to enum info
> + *
> + * @info: ptr to dfl_fpga_enum_info
> + * @nr_irqs: number of irqs of the DFL fpga device to be enumerated.
> + * @irq_table: Linux IRQ numbers for all irqs, indexed by local irq index of
> + *  this device.
> + *
> + * One FPGA device may have several interrupts. This function adds irq
> + * information of the DFL fpga device to enum info for next step enumeration.
> + * This function should be called before dfl_fpga_feature_devs_enumerate().
> + * As we only support one irq domain for all DFLs in the same enum info, 
> adding
> + * irq table a second time for the same enum info will return error.
> + *
> + * If we need to enumerate DFLs which belong to different irq domains, we
> + * should fill more enum info and enumerate them one by one.
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +int dfl_fpga_enum_info_add_irq(struct dfl_fpga_enum_info *info,
> +unsigned int nr_irqs, int *irq_table)
> +{
> + if (!nr_irqs || !irq_table)
> + return -EINVAL;
> +
> + if (info->irq_table)
> + return -EEXIST;
> +
> + info->irq_table = devm_kmemdup(info->dev, irq_table,
> +sizeof(int) * nr_irqs, GFP_KERNEL);
> + if (!info->irq_table)
> + return -ENOMEM;
> +
> + info->nr_irqs = nr_irqs;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dfl_fpga_enum_info_add_irq);
> +
>  static int remove_feature_dev(struct device *dev, void *data)
>  {
>   struct platform_device *pdev = to_platform_device(dev);
> @@ -959,6 +1109,10 @@ dfl_fpga_feature_devs_enumerate(struct 
> dfl_fpga_enum_info *info)
>   binfo->dev = info->dev;
>   binfo->cdev = cdev;
>  
> + binfo->nr_irqs = info->nr_irqs;
> + if (info->nr_irqs)
> + binfo->irq_table = info->irq_table;
> +
>   /*
>* start enumeration for all feature devices based on Device Feature
>* Lists.
> diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> index 74784d3..4bc165f 100644
> --- a/drivers/fpga/dfl.h
> +++ b/drivers/fpga/dfl.h
> @@ -112,6 +112,13 @@
>  #define FME_PORT_OFST_ACC_VF 1
>  #define FME_PORT_OFST_IMPBIT_ULL(60)
>  
> +/* FME Error Capability Register */
> +#define FME_ERROR_CAP0x70
> +
> +/* FME Error Capability Register Bitfield */
> +#define FME_ERROR_CAP_SUPP_INT   BIT_ULL(0)  /* Interrupt 
> Support */
> +#define FME_ERROR_CAP_INT_VECT   GENMASK_ULL(12, 1)  /* Interrupt 
> vector */
> +
>  /* PORT Header Register Set */
>  #define PORT_HDR_DFH DFH
>  #define PORT_HDR_GUID_L  GUID_L
> @@ -145,6 +152,20 @@
>  #define PORT_STS_PWR_STATE_AP2   2   /* 90% 
> throttling */
>  #define PORT_STS_PWR_STATE_AP6   6   /* 100% 
> throttling */
>  
> +/* Port Error Capability Register */
> +#define PORT_ERROR_CAP   0x38
> +
> +/* Port Error Capability Register Bitfield */
> +#define PORT_ERROR_CAP_SUPP_INT  BIT_ULL(0)  /* Interrupt 
> Support */
> +#define PORT_ERROR_CAP_INT_VECT  GENMASK_ULL(12, 1)  /* Interrupt 
> vector */
> +
> +/* Port Uint Capability Register */
> +#define PORT_UINT_CAP0x8
> +
> +/* Port Uint Capability Register Bitfield */
> +#define PORT_UINT_CAP_INT_NUMGENMASK_ULL(11, 0)  /* Interrupts 
> num */
> +#define PORT_UINT_CAP_FST_VECT   GENMASK_ULL(23, 12) /* First Vector 
> */
> +
>  /**
>   * struct dfl_fpga_port_ops - port ops
>   *
> @@ -189,6 +210,15 @@ struct dfl_feature_driver {
>  };
>  
>  /**
> + * struct dfl_feature_irq_ctx - dfl private feature interrupt context
> + *
> + * @irq: Linux IRQ number of this interrupt.
> + */
> +struct dfl_feature_irq_ctx {
> + int irq;
> +};
> +
> +/**
>   * struct dfl_feature - sub feature of the feature devices
>   *
>   * @id: sub feature id.
> @@ -196,12 +226,16 @@ struct dfl_feature_driver {
>   *   this index is used to find its mmio resource from the
>   *   feature dev (platform device)'s reources.
>   * @ioaddr: mapped mmio resource address.
> + * @irq_ctx: interrupt context list.
> + * @nr_irqs: number of interrupt contexts.
>   * @ops: ops of this sub feature.
>   */
>  struct dfl_feature {
>   u64 id;
>   int resource_index;
>   void __iomem *ioaddr;
> + struct dfl_feature_irq_ctx *irq_ctx;
> + unsigned int nr_irqs;
>   const struct dfl_feature_ops *ops;
>  };
>  
> @@ -388,10 +422,14 @@ static inline u8 dfl_feature_revision(void __iomem 
> *base)
>   *
>   * @dev: parent device.
>   * @dfls: list of device feature lists.
> + * @nr_irqs: number of irqs for all feature devices.
> + * @irq_table: Linux IRQ numbers for all irqs, indexed by hw irq numbers.
>   */
>  struct dfl_fpga_enum_info {
>   struct device *dev;
>   struct list_head dfls;
> + unsigned int nr_irqs;
> + int *irq_table;
>  };
>  
>  /**
> @@ -415,6 +453,8 @@ struct dfl_fpga_enum_info 
> *dfl_fpga_enum_info_alloc(struct device *dev);
>  int dfl_fpga_enum_info_add_dfl(struct dfl_fpga_enum_info *info,
>  resource_size_t start, resource_size_t len,
>  void __iomem *ioaddr);
> +int dfl_fpga_enum_info_add_irq(struct dfl_fpga_enum_info *info,
> +unsigned int nr_irqs, int *irq_table);
>  void dfl_fpga_enum_info_free(struct dfl_fpga_enum_info *info);
>  
>  /**
> -- 
> 2.7.4


Reviewed-by: Marcelo Tosatti

Re: [PATCH v5 3/7] fpga: dfl: introduce interrupt trigger setting API

2020-05-25 Thread Marcelo Tosatti

ig_ports_pf(struct dfl_fpga_cdev *cdev);
>  int dfl_fpga_cdev_config_ports_vf(struct dfl_fpga_cdev *cdev, int num_vf);
> +int dfl_fpga_set_irq_triggers(struct dfl_feature *feature, unsigned int 
> start,
> +   unsigned int count, int32_t *fds);
> +long dfl_feature_ioctl_get_num_irqs(struct platform_device *pdev,
> + struct dfl_feature *feature,
> +         unsigned long arg);
> +long dfl_feature_ioctl_set_irq(struct platform_device *pdev,
> +struct dfl_feature *feature,
> +unsigned long arg);
> +
>  #endif /* __FPGA_DFL_H */
> diff --git a/include/uapi/linux/fpga-dfl.h b/include/uapi/linux/fpga-dfl.h
> index ec70a0746..7331350 100644
> --- a/include/uapi/linux/fpga-dfl.h
> +++ b/include/uapi/linux/fpga-dfl.h
> @@ -151,6 +151,19 @@ struct dfl_fpga_port_dma_unmap {
>  
>  #define DFL_FPGA_PORT_DMA_UNMAP  _IO(DFL_FPGA_MAGIC, 
> DFL_PORT_BASE + 4)
>  
> +/**
> + * struct dfl_fpga_irq_set - the argument for DFL_FPGA_XXX_SET_IRQ ioctl.
> + *
> + * @start: Index of the first irq.
> + * @count: The number of eventfd handler.
> + * @evtfds: Eventfd handlers.
> + */
> +struct dfl_fpga_irq_set {
> + __u32 start;
> + __u32 count;
> + __s32 evtfds[];
> +};
> +
>  /* IOCTLs for FME file descriptor */
>  
>  /**
> -- 
> 2.7.4


Reviewed-by: Marcelo Tosatti

Re: [PATCH v5 5/7] fpga: dfl: fme: add interrupt support for global error reporting

2020-05-25 Thread Marcelo Tosatti

_FME_BASE + 4,
> + *   struct dfl_fpga_irq_set)
> + *
> + * Set fpga fme error reporting interrupt trigger if evtfds[n] is valid.
> + * Unset related interrupt trigger if evtfds[n] is a negative value.
> + * Return: 0 on success, -errno on failure.
> + */
> +#define DFL_FPGA_FME_ERR_SET_IRQ _IOW(DFL_FPGA_MAGIC,\
> +  DFL_FME_BASE + 4,  \
> +  struct dfl_fpga_irq_set)
> +
>  #endif /* _UAPI_LINUX_FPGA_DFL_H */
> -- 
> 2.7.4

Reviewed-by: Marcelo Tosatti

Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel

2020-04-28 Thread Marcelo Tosatti



I like the idea as well, especially the reporting infrastructure, and 
would like to see something like this integrated upstream.

On Thu, Mar 05, 2020 at 07:33:13PM +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:07:12PM +, Alex Belits wrote:
> > The existing nohz_full mode is designed as a "soft" isolation mode
> > that makes tradeoffs to minimize userspace interruptions while
> > still attempting to avoid overheads in the kernel entry/exit path,
> > to provide 100% kernel semantics, etc.
> > 
> > However, some applications require a "hard" commitment from the
> > kernel to avoid interruptions, in particular userspace device driver
> > style applications, such as high-speed networking code.
> > 
> > This change introduces a framework to allow applications
> > to elect to have the "hard" semantics as needed, specifying
> > prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> > 
> > The kernel must be built with the new TASK_ISOLATION Kconfig flag
> > to enable this mode, and the kernel booted with an appropriate
> > "isolcpus=nohz,domain,CPULIST" boot argument to enable
> > nohz_full and isolcpus. The "task_isolation" state is then indicated
> > by setting a new task struct field, task_isolation_flag, to the
> > value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> > bit in the thread_info flags. When the kernel is returning to
> > userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> > it calls the new task_isolation_start() routine to arrange for
> > the task to avoid being interrupted in the future.
> > 
> > With interrupts disabled, task_isolation_start() ensures that kernel
> > subsystems that might cause a future interrupt are quiesced. If it
> > doesn't succeed, it adjusts the syscall return value to indicate that
> > fact, and userspace can retry as desired. In addition to stopping
> > the scheduler tick, the code takes any actions that might avoid
> > a future interrupt to the core, such as a worker thread being
> > scheduled that could be quiesced now (e.g. the vmstat worker)
> > or a future IPI to the core to clean up some state that could be
> > cleaned up now (e.g. the mm lru per-cpu cache).
> > 
> > Once the task has returned to userspace after issuing the prctl(),
> > if it enters the kernel again via system call, page fault, or any
> > other exception or irq, the kernel will kill it with SIGKILL.

This severely limits usage of the interface. 

I suppose the reason for blocking system calls is to make sure 
userspace does not initiate actions that might generate interruptions, 
such as IPI flushes (memory unmaps or changes), vmstat work items
(page dirtying), or is there any reason for it ?


+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+   /* No need to incur an isolation signal if we are just exiting. */
+   if (syscall == __NR_exit || syscall == __NR_exit_group)
+   return true;
+   
+   /* Check to see if it's the prctl for isolation. */
+   if (syscall == __NR_prctl) {
+   unsigned long arg[SYSCALL_MAX_ARGS];
+   
+   syscall_get_arguments(current, current_pt_regs(), arg);
+   if (arg[0] == PR_TASK_ISOLATION)
+   return true;
+   }
+ 
+   return false;
+}


> > In addition to sending a signal, the code supports a kernel
> > command-line "task_isolation_debug" flag which causes a stack
> > backtrace to be generated whenever a task loses isolation.
> > 
> > To allow the state to be entered and exited, the syscall checking
> > test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> > clear the bit again later, and ignores exit/exit_group to allow
> > exiting the task without a pointless signal being delivered.
> > 
> > The prctl() API allows for specifying a signal number to use instead
> > of the default SIGKILL, to allow for catching the notification
> > signal; for example, in a production environment, it might be
> > helpful to log information to the application logging mechanism
> > before exiting. Or, the signal handler might choose to reset the
> > program counter back to the code segment intended to be run isolated
> > via prctl() to continue execution.
> 
> Hi Alew,
> 
> I'm glad this patchset is being resurected.
> Reading that changelog, I like the general idea and the direction.
> The diff is a bit scary though but I'll check the patches in detail
> in the upcoming days.
> 
> > 
> > In a number of cases we can tell on a remote cpu that we are
> > going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> > In that case we generate the diagnostic (and optional stack dump)
> > on the remote core to be able to deliver better diagnostics.
> > If the interrupt is not something caught by Linux (e.g. a
> > hypervisor interrupt) we can also request a reschedule IPI to
> > be sent to the remote core so it can be sure to generate a
> >

Re: [PATCH] KVM: Don't shrink/grow vCPU halt_poll_ns if host side polling is disabled

2019-09-27 Thread Marcelo Tosatti

On Fri, Sep 27, 2019 at 04:27:02PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> Don't waste cycles to shrink/grow vCPU halt_poll_ns if host 
> side polling is disabled.
> 
> Cc: Marcelo Tosatti 
> Signed-off-by: Wanpeng Li 
> ---
>  virt/kvm/kvm_main.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e6de315..b368be4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2359,20 +2359,22 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>   kvm_arch_vcpu_unblocking(vcpu);
>   block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>  
> - if (!vcpu_valid_wakeup(vcpu))
> - shrink_halt_poll_ns(vcpu);
> - else if (halt_poll_ns) {
> - if (block_ns <= vcpu->halt_poll_ns)
> - ;
> - /* we had a long block, shrink polling */
> - else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> + if (!kvm_arch_no_poll(vcpu)) {
> + if (!vcpu_valid_wakeup(vcpu))
>   shrink_halt_poll_ns(vcpu);
> - /* we had a short halt and our poll time is too small */
> - else if (vcpu->halt_poll_ns < halt_poll_ns &&
> - block_ns < halt_poll_ns)
> - grow_halt_poll_ns(vcpu);
> - } else
> - vcpu->halt_poll_ns = 0;
> + else if (halt_poll_ns) {
> + if (block_ns <= vcpu->halt_poll_ns)
> + ;
> + /* we had a long block, shrink polling */
> + else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> + shrink_halt_poll_ns(vcpu);
> + /* we had a short halt and our poll time is too small */
> + else if (vcpu->halt_poll_ns < halt_poll_ns &&
> + block_ns < halt_poll_ns)
> + grow_halt_poll_ns(vcpu);
> + } else
> + vcpu->halt_poll_ns = 0;
> + }
>  
>   trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu));
>   kvm_arch_vcpu_block_finish(vcpu);
> -- 
> 2.7.4

Looks good.

Re: [PATCH v3] cpuidle-haltpoll: vcpu hotplug support

2019-09-02 Thread Marcelo Tosatti

On Mon, Sep 02, 2019 at 10:34:07PM +0200, Rafael J. Wysocki wrote:
> On Mon, Sep 2, 2019 at 12:43 PM Joao Martins  
> wrote:
> >
> > When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> > past the online ones and thus fail to register the idle driver.
> > This is because cpuidle_add_sysfs() will return with -ENODEV as a
> > consequence from get_cpu_device() return no device for a non-existing
> > CPU.
> >
> > Instead switch to cpuidle_register_driver() and manually register each
> > of the present cpus through cpuhp_setup_state() callbacks and future
> > ones that get onlined or offlined. This mimmics similar logic that
> > intel_idle does.
> >
> > Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> > Signed-off-by: Joao Martins 
> > Signed-off-by: Boris Ostrovsky 
> > ---
> > v3:
> > * register the teardown callback for correct handling of hotunplug
> > and error cases. In case cpuhp_setup_state calls fails (e.g. in one of
> > the cpus that it invoked the callback) it will then call the teardown of
> > the previously enabled devices; so no need to handle that manually in
> > haltpoll_uninit().
> > * use the cpuhp_setup_state() returned dyn allocated state when it
> > succeeds. And use that state in haltpoll_unint() to call
> > cpuhp_remove_state() instead of looping online cpus manually. This
> > is because cpuhp_remove_state() invokes the teardown/offline callback.
> > * fix subsystem name to 'cpuidle' instead of 'idle' in cpuhp_setup_state()
> 
> Marcelo, is the R-by still applicable?
> 
> Paolo, any comments?
> 
> >
> > v2:
> > * move cpus_read_unlock() after unregistering all cpuidle_devices;
> > (Marcello Tosatti)
> > * redundant usage of cpuidle_unregister() when only
> > cpuidle_unregister_driver() suffices; (Marcelo Tosatti)
> > * cpuhp_setup_state() returns a state (> 0) for CPUHP_AP_ONLINE_DYN
> > ---
> >  arch/x86/include/asm/cpuidle_haltpoll.h |  4 +-
> >  arch/x86/kernel/kvm.c   | 18 +++
> >  drivers/cpuidle/cpuidle-haltpoll.c  | 68 +++--
> >  include/linux/cpuidle_haltpoll.h|  4 +-
> >  4 files changed, 73 insertions(+), 21 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/cpuidle_haltpoll.h 
> > b/arch/x86/include/asm/cpuidle_haltpoll.h
> > index ff8607d81526..c8b39c6716ff 100644
> > --- a/arch/x86/include/asm/cpuidle_haltpoll.h
> > +++ b/arch/x86/include/asm/cpuidle_haltpoll.h
> > @@ -2,7 +2,7 @@
> >  #ifndef _ARCH_HALTPOLL_H
> >  #define _ARCH_HALTPOLL_H
> >
> > -void arch_haltpoll_enable(void);
> > -void arch_haltpoll_disable(void);
> > +void arch_haltpoll_enable(unsigned int cpu);
> > +void arch_haltpoll_disable(unsigned int cpu);
> >
> >  #endif
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 8d150e3732d9..a9b6c4e2446d 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -880,32 +880,26 @@ static void kvm_enable_host_haltpoll(void *i)
> > wrmsrl(MSR_KVM_POLL_CONTROL, 1);
> >  }
> >
> > -void arch_haltpoll_enable(void)
> > +void arch_haltpoll_enable(unsigned int cpu)
> >  {
> > if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) {
> > -   printk(KERN_ERR "kvm: host does not support poll 
> > control\n");
> > -   printk(KERN_ERR "kvm: host upgrade recommended\n");
> > +   pr_err_once("kvm: host does not support poll control\n");
> > +   pr_err_once("kvm: host upgrade recommended\n");
> > return;
> > }
> >
> > -   preempt_disable();
> > /* Enable guest halt poll disables host halt poll */
> > -   kvm_disable_host_haltpoll(NULL);
> > -   smp_call_function(kvm_disable_host_haltpoll, NULL, 1);
> > -   preempt_enable();
> > +   smp_call_function_single(cpu, kvm_disable_host_haltpoll, NULL, 1);
> >  }
> >  EXPORT_SYMBOL_GPL(arch_haltpoll_enable);
> >
> > -void arch_haltpoll_disable(void)
> > +void arch_haltpoll_disable(unsigned int cpu)
> >  {
> > if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
> > return;
> >
> > -   preempt_disable();
> > /* Enable guest halt poll disables host halt poll */
> > -   kvm_enable_host_haltpoll(NULL);
> > -   smp_call_function(kvm_enable_host_haltpoll, NULL, 1);
> > -   preempt_enable();
> > +   smp_call_functi

Re: Is: Default governor regardless of cpuidle driver Was: [PATCH v2] cpuidle-haltpoll: vcpu hotplug support

2019-08-29 Thread Marcelo Tosatti

On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote:
> On 8/29/19 4:10 PM, Joao Martins wrote:
> > When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> > past the online ones and thus fail to register the idle driver.
> > This is because cpuidle_add_sysfs() will return with -ENODEV as a
> > consequence from get_cpu_device() return no device for a non-existing
> > CPU.
> > 
> > Instead switch to cpuidle_register_driver() and manually register each
> > of the present cpus through cpuhp_setup_state() callback and future
> > ones that get onlined. This mimmics similar logic that intel_idle does.
> > 
> > Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> > Signed-off-by: Joao Martins 
> > Signed-off-by: Boris Ostrovsky 
> > ---
> 
> While testing the above, I found out another issue on the haltpoll series.
> But I am not sure what is best suited to cpuidle framework, hence requesting
> some advise if below is a reasonable solution or something else is preferred.
> 
> Essentially after haltpoll governor got introduced and regardless of the 
> cpuidle
> driver the default governor is gonna be haltpoll for a guest (given haltpoll
> governor doesn't get registered for baremetal).

Right.

> Right now, for a KVM guest, the
> idle governors have these ratings:
> 
>  * ladder-> 10
>  * teo   -> 19
>  * menu  -> 20
>  * haltpoll  -> 21
>  * ladder + nohz=off -> 25

Yes. PowerPC KVM guests crash currently due to the use of the haltpoll
governor (have a patch in my queue to fix this, but your solution
embraces more cases).

> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
> registered, we will end up with a haltpoll governor being used as opposed to
> 'menu' (which used to be the default case). This would prevent IIUC that other
> C-states get used other than poll_state (state 0) and state 1.
> 
> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
> it doesn't look reasonable to be the default? What about using haltpoll 
> governor
> as default when haltpoll idle driver registers or modloads.
> 
> My idea to achieve the above would be to decrease the rating to 9 (before the
> lowest rated governor) and retain old defaults before haltpoll. Then we would
> allow a cpuidle driver to define a preferred governor to switch on idle driver
> registration. Naturally all of would be ignored if overidden by
> cpuidle.governor=.
> 
> The diff below the scissors line is an example of that.
> 
> Thoughts?

Works for me. Rafael?

> 
> -- >8 
> 
> From: Joao Martins 
> Subject: [PATCH] cpuidle: switch to prefered governor on registration
> 
> Signed-off-by: Joao Martins 
> ---
>  drivers/cpuidle/cpuidle-haltpoll.c   |  1 +
>  drivers/cpuidle/cpuidle.h|  1 +
>  drivers/cpuidle/driver.c | 26 ++
>  drivers/cpuidle/governor.c   |  6 +++---
>  drivers/cpuidle/governors/haltpoll.c |  2 +-
>  include/linux/cpuidle.h  |  3 +++
>  6 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-haltpoll.c 
> b/drivers/cpuidle/cpuidle-haltpoll.c
> index 8baade23f8d0..88a38c3c35e4 100644
> --- a/drivers/cpuidle/cpuidle-haltpoll.c
> +++ b/drivers/cpuidle/cpuidle-haltpoll.c
> @@ -33,6 +33,7 @@ static int default_enter_idle(struct cpuidle_device *dev,
> 
>  static struct cpuidle_driver haltpoll_driver = {
>   .name = "haltpoll",
> + .governor = "haltpoll",
>   .owner = THIS_MODULE,
>   .states = {
>   { /* entry 0 is for polling */ },
> diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
> index d6613101af92..c046f49c1920 100644
> --- a/drivers/cpuidle/cpuidle.h
> +++ b/drivers/cpuidle/cpuidle.h
> @@ -22,6 +22,7 @@ extern void cpuidle_install_idle_handler(void);
>  extern void cpuidle_uninstall_idle_handler(void);
> 
>  /* governors */
> +extern struct cpuidle_governor *cpuidle_find_governor(const char *str);
>  extern int cpuidle_switch_governor(struct cpuidle_governor *gov);
> 
>  /* sysfs */
> diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
> index dc32f34e68d9..8b8b9d89ce58 100644
> --- a/drivers/cpuidle/driver.c
> +++ b/drivers/cpuidle/driver.c
> @@ -87,6 +87,7 @@ static inline int __cpuidle_set_driver(struct 
> cpuidle_driver *drv)
>  #else
> 
>  static struct cpuidle_driver *cpuidle_curr_driver;
> +static struct cpuidle_governor *cpuidle_default_governor = NULL;
> 
>  /**
>   * __cpuidle_get_cpu_driver - return the global cpuidle driver pointer.
> @@ -254,12 +255,25 @@ static void __cpuidle_unregister_driver(struct
> cpuidle_driver *drv)
>   */
>  int cpuidle_register_driver(struct cpuidle_driver *drv)
>  {
> + struct cpuidle_governor *gov;
>   int ret;
> 
>   spin_lock(_driver_lock);
>   ret = __cpuidle_register_driver(drv);
>   spin_unlock(_driver_lock);
> 
> + if (!ret &&

Re: [PATCH v2] cpuidle-haltpoll: vcpu hotplug support

2019-08-29 Thread Marcelo Tosatti

On Thu, Aug 29, 2019 at 04:10:27PM +0100, Joao Martins wrote:
> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> past the online ones and thus fail to register the idle driver.
> This is because cpuidle_add_sysfs() will return with -ENODEV as a
> consequence from get_cpu_device() return no device for a non-existing
> CPU.
> 
> Instead switch to cpuidle_register_driver() and manually register each
> of the present cpus through cpuhp_setup_state() callback and future
> ones that get onlined. This mimmics similar logic that intel_idle does.
> 
> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> Signed-off-by: Joao Martins 
> Signed-off-by: Boris Ostrovsky 
> ---
> v2:
> * move cpus_read_unlock() right after unregistering all cpuidle_devices;
> (Marcello Tosatti)
> * redundant usage of cpuidle_unregister() when only
> cpuidle_unregister_driver() suffices; (Marcelo Tosatti)
> * cpuhp_setup_state() returns a state (> 0) on success with 
> CPUHP_AP_ONLINE_DYN
> thus we set @ret to 0
> ---
>  arch/x86/include/asm/cpuidle_haltpoll.h |  4 +-
>  arch/x86/kernel/kvm.c   | 18 +++
>  drivers/cpuidle/cpuidle-haltpoll.c  | 67 +++--
>  include/linux/cpuidle_haltpoll.h|  4 +-
>  4 files changed, 72 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpuidle_haltpoll.h 
> b/arch/x86/include/asm/cpuidle_haltpoll.h
> index ff8607d81526..c8b39c6716ff 100644
> --- a/arch/x86/include/asm/cpuidle_haltpoll.h
> +++ b/arch/x86/include/asm/cpuidle_haltpoll.h
> @@ -2,7 +2,7 @@
>  #ifndef _ARCH_HALTPOLL_H
>  #define _ARCH_HALTPOLL_H
>  
> -void arch_haltpoll_enable(void);
> -void arch_haltpoll_disable(void);
> +void arch_haltpoll_enable(unsigned int cpu);
> +void arch_haltpoll_disable(unsigned int cpu);
>  
>  #endif
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 8d150e3732d9..a9b6c4e2446d 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -880,32 +880,26 @@ static void kvm_enable_host_haltpoll(void *i)
>   wrmsrl(MSR_KVM_POLL_CONTROL, 1);
>  }
>  
> -void arch_haltpoll_enable(void)
> +void arch_haltpoll_enable(unsigned int cpu)
>  {
>   if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) {
> - printk(KERN_ERR "kvm: host does not support poll control\n");
> - printk(KERN_ERR "kvm: host upgrade recommended\n");
> + pr_err_once("kvm: host does not support poll control\n");
> + pr_err_once("kvm: host upgrade recommended\n");
>   return;
>   }
>  
> - preempt_disable();
>   /* Enable guest halt poll disables host halt poll */
> - kvm_disable_host_haltpoll(NULL);
> - smp_call_function(kvm_disable_host_haltpoll, NULL, 1);
> - preempt_enable();
> + smp_call_function_single(cpu, kvm_disable_host_haltpoll, NULL, 1);
>  }
>  EXPORT_SYMBOL_GPL(arch_haltpoll_enable);
>  
> -void arch_haltpoll_disable(void)
> +void arch_haltpoll_disable(unsigned int cpu)
>  {
>   if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
>   return;
>  
> - preempt_disable();
>   /* Enable guest halt poll disables host halt poll */
> - kvm_enable_host_haltpoll(NULL);
> - smp_call_function(kvm_enable_host_haltpoll, NULL, 1);
> - preempt_enable();
> + smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
>  }
>  EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
>  #endif
> diff --git a/drivers/cpuidle/cpuidle-haltpoll.c 
> b/drivers/cpuidle/cpuidle-haltpoll.c
> index 9ac093dcbb01..8baade23f8d0 100644
> --- a/drivers/cpuidle/cpuidle-haltpoll.c
> +++ b/drivers/cpuidle/cpuidle-haltpoll.c
> @@ -11,12 +11,15 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  
> +static struct cpuidle_device __percpu *haltpoll_cpuidle_devices;
> +
>  static int default_enter_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
>  {
> @@ -46,6 +49,48 @@ static struct cpuidle_driver haltpoll_driver = {
>   .state_count = 2,
>  };
>  
> +static int haltpoll_cpu_online(unsigned int cpu)
> +{
> + struct cpuidle_device *dev;
> +
> + dev = per_cpu_ptr(haltpoll_cpuidle_devices, cpu);
> + if (!dev->registered) {
> + dev->cpu = cpu;
> + if (cpuidle_register_device(dev)) {
> + pr_notice("cpuidle_register_device %d failed!\n", cpu);
> + return -EIO;
> + }
> +

Re: [PATCH v1] cpuidle-haltpoll: vcpu hotplug support

2019-08-29 Thread Marcelo Tosatti

On Thu, Aug 29, 2019 at 03:24:31PM +0100, Joao Martins wrote:
> On 8/29/19 2:50 PM, Joao Martins wrote:
> > On 8/29/19 12:56 PM, Marcelo Tosatti wrote:
> >> Hi Joao,
> >>
> >> On Wed, Aug 28, 2019 at 07:56:50PM +0100, Joao Martins wrote:
> >>> +static void haltpoll_uninit(void)
> >>> +{
> >>> + unsigned int cpu;
> >>> +
> >>> + cpus_read_lock();
> >>> +
> >>> + for_each_online_cpu(cpu) {
> >>> + struct cpuidle_device *dev =
> >>> + per_cpu_ptr(haltpoll_cpuidle_devices, cpu);
> >>> +
> >>> + if (!dev->registered)
> >>> + continue;
> >>> +
> >>> + arch_haltpoll_disable(cpu);
> >>> + cpuidle_unregister_device(dev);
> >>> + }
> >>
> >> 1)
> >>
> >>> +
> >>> + cpuidle_unregister(_driver);
> >>
> >> cpuidle_unregister_driver.
> > 
> > Will fix -- this was an oversight.
> > 
> >>
> >>> + free_percpu(haltpoll_cpuidle_devices);
> >>> + haltpoll_cpuidle_devices = NULL;
> >>> +
> >>> + cpus_read_unlock();
> >>
> >> Any reason you can't cpus_read_unlock() at 1) ?
> >>
> > No, let me adjust that too.
> > 
> >> Looks good otherwise.
> >>
> 
> BTW, should I take this as a Acked-by, Reviewed-by, or neither? :)
> 
>   Joao

I'll ACK -v2 once you send it.

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-29 Thread Marcelo Tosatti

On Thu, Aug 29, 2019 at 09:53:04AM -0300, Marcelo Tosatti wrote:
> On Thu, Aug 29, 2019 at 08:16:41PM +0800, Wanpeng Li wrote:
> > > Current situation regarding haltpoll driver is:
> > >
> > > overcommit group: haltpoll driver is not loaded by default, they are
> > > happy.
> > >
> > > non overcommit group: boots without "realtime hints" flag, loads haltpoll 
> > > driver,
> > > happy.
> > >
> > > Situation with patch above:
> > >
> > > overcommit group: haltpoll driver is not loaded by default, they are
> > > happy.
> > >
> > > non overcommit group: boots without "realtime hints" flag, haltpoll driver
> > > cannot be loaded.
> > 
> > non overcommit group, if they don't care latency/performance, they
> > don't need to enable haltpoll, "realtime hints" etc. Otherwise, they
> > should better tune.
> 
> As mentioned before, "being overcommitted" is a property which is 
> transitional.
> 
> A static true/false scheme reflects this poorly.
> 
> Therefore the OS should detect it and act accordingly.

Hi Wanpeng Li,

One suggestion for a dynamic "is overcommited" scheme:

If the amount of stolen time, in the past record_steal_time window, 
is more than 20% of the time in that window, then mark system
as overcommitted. Otherwise, clear it.

Make that 20% configurable by as kvm module parameter.

Use that info to enable/disable overcommit features.

That should work, right?

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-29 Thread Marcelo Tosatti

On Thu, Aug 29, 2019 at 08:16:41PM +0800, Wanpeng Li wrote:
> > Current situation regarding haltpoll driver is:
> >
> > overcommit group: haltpoll driver is not loaded by default, they are
> > happy.
> >
> > non overcommit group: boots without "realtime hints" flag, loads haltpoll 
> > driver,
> > happy.
> >
> > Situation with patch above:
> >
> > overcommit group: haltpoll driver is not loaded by default, they are
> > happy.
> >
> > non overcommit group: boots without "realtime hints" flag, haltpoll driver
> > cannot be loaded.
> 
> non overcommit group, if they don't care latency/performance, they
> don't need to enable haltpoll, "realtime hints" etc. Otherwise, they
> should better tune.

As mentioned before, "being overcommitted" is a property which is transitional.

A static true/false scheme reflects this poorly.

Therefore the OS should detect it and act accordingly.

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-29 Thread Marcelo Tosatti

On Thu, Aug 29, 2019 at 01:37:35AM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 28, 2019 at 4:39 PM Marcelo Tosatti  wrote:
> >
> > On Wed, Aug 28, 2019 at 10:45:44AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Aug 28, 2019 at 10:34 AM Wanpeng Li  wrote:
> > > >
> > > > On Tue, 27 Aug 2019 at 08:43, Wanpeng Li  wrote:
> > > > >
> > > > > Cc Michael S. Tsirkin,
> > > > > On Tue, 27 Aug 2019 at 04:42, Marcelo Tosatti  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 08:55:29AM +0800, Wanpeng Li wrote:
> > > > > > > On Sun, 4 Aug 2019 at 04:21, Marcelo Tosatti 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Thu, Aug 01, 2019 at 06:54:49PM +0200, Paolo Bonzini wrote:
> > > > > > > > > On 01/08/19 18:51, Rafael J. Wysocki wrote:
> > > > > > > > > > On 8/1/2019 9:06 AM, Wanpeng Li wrote:
> > > > > > > > > >> From: Wanpeng Li 
> > > > > > > > > >>
> > > > > > > > > >> The downside of guest side polling is that polling is 
> > > > > > > > > >> performed even
> > > > > > > > > >> with other runnable tasks in the host. However, even if 
> > > > > > > > > >> poll in kvm
> > > > > > > > > >> can aware whether or not other runnable tasks in the same 
> > > > > > > > > >> pCPU, it
> > > > > > > > > >> can still incur extra overhead in over-subscribe scenario. 
> > > > > > > > > >> Now we can
> > > > > > > > > >> just enable guest polling when dedicated pCPUs are 
> > > > > > > > > >> available.
> > > > > > > > > >>
> > > > > > > > > >> Cc: Rafael J. Wysocki 
> > > > > > > > > >> Cc: Paolo Bonzini 
> > > > > > > > > >> Cc: Radim Krčmář 
> > > > > > > > > >> Cc: Marcelo Tosatti 
> > > > > > > > > >> Signed-off-by: Wanpeng Li 
> > > > > > > > > >
> > > > > > > > > > Paolo, Marcelo, any comments?
> > > > > > > > >
> > > > > > > > > Yes, it's a good idea.
> > > > > > > > >
> > > > > > > > > Acked-by: Paolo Bonzini 
> > > >
> > > > Hi Marcelo,
> > > >
> > > > If you don't have more concern, I guess Rafael can apply this patch
> > > > now since the merge window is not too far.
> > >
> > > I will likely queue it up later today and it will go to linux-next
> > > early next week.
> > >
> > > Thanks!
> >
> > NACK patch.
> 
> I got an ACK from Paolo on it, though.  Convince Paolo to withdraw his
> ACK if you want it to not be applied.
> 
> > Just don't load the haltpoll driver.
> 
> And why would that be better?

Split the group of all kvm users in two: overcommit group and non-overcommit
group.

Current situation regarding haltpoll driver is:

overcommit group: haltpoll driver is not loaded by default, they are
happy.

non overcommit group: boots without "realtime hints" flag, loads haltpoll 
driver, 
happy.

Situation with patch above:

overcommit group: haltpoll driver is not loaded by default, they are
happy.

non overcommit group: boots without "realtime hints" flag, haltpoll driver
cannot be loaded.

Re: [PATCH v1] cpuidle-haltpoll: vcpu hotplug support

2019-08-29 Thread Marcelo Tosatti

Hi Joao,

On Wed, Aug 28, 2019 at 07:56:50PM +0100, Joao Martins wrote:
> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> past the online ones and thus fail to register the idle driver.
> This is because cpuidle_add_sysfs() will return with -ENODEV as a
> consequence from get_cpu_device() return no device for a non-existing
> CPU.
> 
> Instead switch to cpuidle_register_driver() and manually register each
> of the present cpus through cpuhp_setup_state() and future ones that
> get onlined. This mimics similar logic as intel_idle.
> 
> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> Signed-off-by: Joao Martins 
> Signed-off-by: Boris Ostrovsky 
> ---
>  arch/x86/include/asm/cpuidle_haltpoll.h |  4 +-
>  arch/x86/kernel/kvm.c   | 18 +++
>  drivers/cpuidle/cpuidle-haltpoll.c  | 65 +++--
>  include/linux/cpuidle_haltpoll.h|  4 +-
>  4 files changed, 70 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpuidle_haltpoll.h 
> b/arch/x86/include/asm/cpuidle_haltpoll.h
> index ff8607d81526..c8b39c6716ff 100644
> --- a/arch/x86/include/asm/cpuidle_haltpoll.h
> +++ b/arch/x86/include/asm/cpuidle_haltpoll.h
> @@ -2,7 +2,7 @@
>  #ifndef _ARCH_HALTPOLL_H
>  #define _ARCH_HALTPOLL_H
>  
> -void arch_haltpoll_enable(void);
> -void arch_haltpoll_disable(void);
> +void arch_haltpoll_enable(unsigned int cpu);
> +void arch_haltpoll_disable(unsigned int cpu);
>  
>  #endif
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 8d150e3732d9..a9b6c4e2446d 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -880,32 +880,26 @@ static void kvm_enable_host_haltpoll(void *i)
>   wrmsrl(MSR_KVM_POLL_CONTROL, 1);
>  }
>  
> -void arch_haltpoll_enable(void)
> +void arch_haltpoll_enable(unsigned int cpu)
>  {
>   if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) {
> - printk(KERN_ERR "kvm: host does not support poll control\n");
> - printk(KERN_ERR "kvm: host upgrade recommended\n");
> + pr_err_once("kvm: host does not support poll control\n");
> + pr_err_once("kvm: host upgrade recommended\n");
>   return;
>   }
>  
> - preempt_disable();
>   /* Enable guest halt poll disables host halt poll */
> - kvm_disable_host_haltpoll(NULL);
> - smp_call_function(kvm_disable_host_haltpoll, NULL, 1);
> - preempt_enable();
> + smp_call_function_single(cpu, kvm_disable_host_haltpoll, NULL, 1);
>  }
>  EXPORT_SYMBOL_GPL(arch_haltpoll_enable);
>  
> -void arch_haltpoll_disable(void)
> +void arch_haltpoll_disable(unsigned int cpu)
>  {
>   if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
>   return;
>  
> - preempt_disable();
>   /* Enable guest halt poll disables host halt poll */
> - kvm_enable_host_haltpoll(NULL);
> - smp_call_function(kvm_enable_host_haltpoll, NULL, 1);
> - preempt_enable();
> + smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
>  }
>  EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
>  #endif
> diff --git a/drivers/cpuidle/cpuidle-haltpoll.c 
> b/drivers/cpuidle/cpuidle-haltpoll.c
> index 9ac093dcbb01..0d1853a7185e 100644
> --- a/drivers/cpuidle/cpuidle-haltpoll.c
> +++ b/drivers/cpuidle/cpuidle-haltpoll.c
> @@ -11,12 +11,15 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  
> +static struct cpuidle_device __percpu *haltpoll_cpuidle_devices;
> +
>  static int default_enter_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
>  {
> @@ -46,6 +49,48 @@ static struct cpuidle_driver haltpoll_driver = {
>   .state_count = 2,
>  };
>  
> +static int haltpoll_cpu_online(unsigned int cpu)
> +{
> + struct cpuidle_device *dev;
> +
> + dev = per_cpu_ptr(haltpoll_cpuidle_devices, cpu);
> + if (!dev->registered) {
> + dev->cpu = cpu;
> + if (cpuidle_register_device(dev)) {
> + pr_notice("cpuidle_register_device %d failed!\n", cpu);
> + return -EIO;
> + }
> + arch_haltpoll_enable(cpu);
> + }
> +
> + return 0;
> +}
> +
> +static void haltpoll_uninit(void)
> +{
> + unsigned int cpu;
> +
> + cpus_read_lock();
> +
> + for_each_online_cpu(cpu) {
> + struct cpuidle_device *dev =
> + per_cpu_ptr(haltpoll_cpuidle_devices, cpu);
> +
> + if (!dev->registered)
> + continue;
> +
> + arch_haltpoll_disable(cpu);
> + cpuidle_unregister_device(dev);
> + }

1)

> +
> + cpuidle_unregister(_driver);

cpuidle_unregister_driver.

> + free_percpu(haltpoll_cpuidle_devices);
> + haltpoll_cpuidle_devices = NULL;
> +
> + cpus_read_unlock();

Any reason you can't cpus_read_unlock() at 1) ?

Looks good otherwise.

Thanks!

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-28 Thread Marcelo Tosatti

On Wed, Aug 28, 2019 at 11:48:58AM -0300, Marcelo Tosatti wrote:
> On Tue, Aug 27, 2019 at 08:43:13AM +0800, Wanpeng Li wrote:
> > > > kvm adaptive halt-polling will compete with
> > > > vhost-kthreads, however, poll in guest unaware other runnable tasks in
> > > > the host which will defeat vhost-kthreads.
> > >
> > > It depends on how much work vhost-kthreads needs to do, how successful
> > > halt-poll in the guest is, and what improvement halt-polling brings.
> > > The amount of polling will be reduced to zero if polling
> > > is not successful.
> > 
> > We observe vhost-kthreads compete with vCPUs adaptive halt-polling in
> > kvm, it hurt performance in over-subscribe product environment,
> > polling in guest can make it worse.
> > 
> > Regards,
> > Wanpeng Li
> 
> Wanpeng,
> 
> Polling should not be performed if there is other work to do. For
> example, halt-polling could check a host/guest shared memory 
> region indicating whether there are other runnable tasks in the host.
> 
> Disabling polling means you will not achieve the improvement 
> even in the transitional periods where the system is not
> overcommitted (which should be frequent given that idling 
> is common).
> 
> Again, about your patch: it brings no benefit to anyone. 
> 
> Guest halt polling should be already disabled by default
> (the driver has to be loaded for guest polling to take place).

The most efficient solution would be to mwait on a memory 
region that both host and guest would write to.

No cpu cycles burned, full efficiency.

However both host and guest would have to write to this region, which
brings security concerns.

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-28 Thread Marcelo Tosatti

On Tue, Aug 27, 2019 at 08:43:13AM +0800, Wanpeng Li wrote:
> > > kvm adaptive halt-polling will compete with
> > > vhost-kthreads, however, poll in guest unaware other runnable tasks in
> > > the host which will defeat vhost-kthreads.
> >
> > It depends on how much work vhost-kthreads needs to do, how successful
> > halt-poll in the guest is, and what improvement halt-polling brings.
> > The amount of polling will be reduced to zero if polling
> > is not successful.
> 
> We observe vhost-kthreads compete with vCPUs adaptive halt-polling in
> kvm, it hurt performance in over-subscribe product environment,
> polling in guest can make it worse.
> 
> Regards,
> Wanpeng Li

Wanpeng,

Polling should not be performed if there is other work to do. For
example, halt-polling could check a host/guest shared memory 
region indicating whether there are other runnable tasks in the host.

Disabling polling means you will not achieve the improvement 
even in the transitional periods where the system is not
overcommitted (which should be frequent given that idling 
is common).

Again, about your patch: it brings no benefit to anyone. 

Guest halt polling should be already disabled by default
(the driver has to be loaded for guest polling to take place).

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-28 Thread Marcelo Tosatti

On Wed, Aug 28, 2019 at 10:45:44AM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 28, 2019 at 10:34 AM Wanpeng Li  wrote:
> >
> > On Tue, 27 Aug 2019 at 08:43, Wanpeng Li  wrote:
> > >
> > > Cc Michael S. Tsirkin,
> > > On Tue, 27 Aug 2019 at 04:42, Marcelo Tosatti  wrote:
> > > >
> > > > On Tue, Aug 13, 2019 at 08:55:29AM +0800, Wanpeng Li wrote:
> > > > > On Sun, 4 Aug 2019 at 04:21, Marcelo Tosatti  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Aug 01, 2019 at 06:54:49PM +0200, Paolo Bonzini wrote:
> > > > > > > On 01/08/19 18:51, Rafael J. Wysocki wrote:
> > > > > > > > On 8/1/2019 9:06 AM, Wanpeng Li wrote:
> > > > > > > >> From: Wanpeng Li 
> > > > > > > >>
> > > > > > > >> The downside of guest side polling is that polling is 
> > > > > > > >> performed even
> > > > > > > >> with other runnable tasks in the host. However, even if poll 
> > > > > > > >> in kvm
> > > > > > > >> can aware whether or not other runnable tasks in the same 
> > > > > > > >> pCPU, it
> > > > > > > >> can still incur extra overhead in over-subscribe scenario. Now 
> > > > > > > >> we can
> > > > > > > >> just enable guest polling when dedicated pCPUs are available.
> > > > > > > >>
> > > > > > > >> Cc: Rafael J. Wysocki 
> > > > > > > >> Cc: Paolo Bonzini 
> > > > > > > >> Cc: Radim Krčmář 
> > > > > > > >> Cc: Marcelo Tosatti 
> > > > > > > >> Signed-off-by: Wanpeng Li 
> > > > > > > >
> > > > > > > > Paolo, Marcelo, any comments?
> > > > > > >
> > > > > > > Yes, it's a good idea.
> > > > > > >
> > > > > > > Acked-by: Paolo Bonzini 
> >
> > Hi Marcelo,
> >
> > If you don't have more concern, I guess Rafael can apply this patch
> > now since the merge window is not too far.
> 
> I will likely queue it up later today and it will go to linux-next
> early next week.
> 
> Thanks!

NACK patch.

Just don't load the haltpoll driver.

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-26 Thread Marcelo Tosatti

On Tue, Aug 13, 2019 at 08:55:29AM +0800, Wanpeng Li wrote:
> On Sun, 4 Aug 2019 at 04:21, Marcelo Tosatti  wrote:
> >
> > On Thu, Aug 01, 2019 at 06:54:49PM +0200, Paolo Bonzini wrote:
> > > On 01/08/19 18:51, Rafael J. Wysocki wrote:
> > > > On 8/1/2019 9:06 AM, Wanpeng Li wrote:
> > > >> From: Wanpeng Li 
> > > >>
> > > >> The downside of guest side polling is that polling is performed even
> > > >> with other runnable tasks in the host. However, even if poll in kvm
> > > >> can aware whether or not other runnable tasks in the same pCPU, it
> > > >> can still incur extra overhead in over-subscribe scenario. Now we can
> > > >> just enable guest polling when dedicated pCPUs are available.
> > > >>
> > > >> Cc: Rafael J. Wysocki 
> > > >> Cc: Paolo Bonzini 
> > > >> Cc: Radim Krčmář 
> > > >> Cc: Marcelo Tosatti 
> > > >> Signed-off-by: Wanpeng Li 
> > > >
> > > > Paolo, Marcelo, any comments?
> > >
> > > Yes, it's a good idea.
> > >
> > > Acked-by: Paolo Bonzini 
> > >
> > > Paolo
> >
> 
> Hi Marcelo,
> 
> Sorry for the late response.
> 
> > I think KVM_HINTS_REALTIME is being abused somewhat.
> > It has no clear meaning and used in different locations
> > for different purposes.
> 
> ==  =
> KVM_HINTS_REALTIME 0  guest checks this feature bit to
> 
> determine that vCPUs are never
> 
> preempted for an unlimited time

Unlimited time means infinite time, or unlimited time means 
10s ? 1s ?

The previous definition was much better IMO: HINTS_DEDICATED.

> allowing optimizations
> ==  =
> 
> Now it disables pv queued spinlock, 

OK. 

> pv tlb shootdown, 

OK.

> pv sched yield

"The idea is from Xen, when sending a call-function IPI-many to vCPUs,
yield if any of the IPI target vCPUs was preempted. 17% performance
increasement of ebizzy benchmark can be observed in an over-subscribe
environment. (w/ kvm-pv-tlb disabled, testing TLB flush call-function
IPI-many since call-function is not easy to be trigged by userspace
workload)."

This can probably hurt if vcpus are rarely preempted. 

> which are not expected present in vCPUs are never preempted for an
> unlimited time scenario.
> 
> >
> > For example, i think that using pv queued spinlocks and
> > haltpoll is a desired scenario, which the patch below disallows.
> 
> So even if dedicated pCPU is available, pv queued spinlocks should
> still be chose if something like vhost-kthreads are used instead of
> DPDK/vhost-user. 

Can't you enable the individual features you need for optimizing 
the overcommitted case? This is how things have been done historically:
If a new feature is available, you enable it to get the desired
performance. x2apic, invariant-tsc, cpuidle haltpoll...

So in your case: enable pv schedyield, enable pv tlb shootdown.

> kvm adaptive halt-polling will compete with
> vhost-kthreads, however, poll in guest unaware other runnable tasks in
> the host which will defeat vhost-kthreads.

It depends on how much work vhost-kthreads needs to do, how successful 
halt-poll in the guest is, and what improvement halt-polling brings.
The amount of polling will be reduced to zero if polling 
is not successful.

Re: [PATCH] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available

2019-08-03 Thread Marcelo Tosatti

On Thu, Aug 01, 2019 at 06:54:49PM +0200, Paolo Bonzini wrote:
> On 01/08/19 18:51, Rafael J. Wysocki wrote:
> > On 8/1/2019 9:06 AM, Wanpeng Li wrote:
> >> From: Wanpeng Li 
> >>
> >> The downside of guest side polling is that polling is performed even
> >> with other runnable tasks in the host. However, even if poll in kvm
> >> can aware whether or not other runnable tasks in the same pCPU, it
> >> can still incur extra overhead in over-subscribe scenario. Now we can
> >> just enable guest polling when dedicated pCPUs are available.
> >>
> >> Cc: Rafael J. Wysocki 
> >> Cc: Paolo Bonzini 
> >> Cc: Radim Krčmář 
> >> Cc: Marcelo Tosatti 
> >> Signed-off-by: Wanpeng Li 
> > 
> > Paolo, Marcelo, any comments?
> 
> Yes, it's a good idea.
> 
> Acked-by: Paolo Bonzini 
> 
> Paolo

I think KVM_HINTS_REALTIME is being abused somewhat.
It has no clear meaning and used in different locations 
for different purposes.

For example, i think that using pv queued spinlocks and 
haltpoll is a desired scenario, which the patch below disallows.

Wanpeng Li, currently the driver does not autoload. So polling in 
the guest has to be enabled manually. Isnt that sufficient?

Re: [PATCH] Documentation: kvm: document CPUID bit for MSR_KVM_POLL_CONTROL

2019-07-02 Thread Marcelo Tosatti

On Tue, Jul 02, 2019 at 06:57:53PM +0200, Paolo Bonzini wrote:
> Cc: Marcelo Tosatti 
> Signed-off-by: Paolo Bonzini 
> ---
>  Documentation/virtual/kvm/cpuid.txt | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/cpuid.txt 
> b/Documentation/virtual/kvm/cpuid.txt
> index 979a77ba5377..2bdac528e4a2 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -66,6 +66,10 @@ KVM_FEATURE_PV_SEND_IPI||11 || guest 
> checks this feature bit
> ||   || before using paravirtualized
> ||   || send IPIs.
>  
> --
> +KVM_FEATURE_PV_POLL_CONTROL||12 || host-side polling on HLT can
> +   ||   || be disabled by writing
> +   ||   || to msr 0x4b564d05.
> +--
>  KVM_FEATURE_PV_SCHED_YIELD ||13 || guest checks this feature bit
> ||   || before using paravirtualized
> ||   || sched yield.
> -- 
> 1.8.3.1

ACK

Re: [PATCH v5 0/4] KVM: LAPIC: Implement Exitless Timer

2019-07-02 Thread Marcelo Tosatti

On Tue, Jul 02, 2019 at 06:38:56PM +0200, Paolo Bonzini wrote:
> On 21/06/19 11:39, Wanpeng Li wrote:
> > Dedicated instances are currently disturbed by unnecessary jitter due 
> > to the emulated lapic timers fire on the same pCPUs which vCPUs resident.
> > There is no hardware virtual timer on Intel for guest like ARM. Both 
> > programming timer in guest and the emulated timer fires incur vmexits.
> > This patchset tries to avoid vmexit which is incurred by the emulated 
> > timer fires in dedicated instance scenario. 
> > 
> > When nohz_full is enabled in dedicated instances scenario, the unpinned 
> > timer will be moved to the nearest busy housekeepers after commit
> > 9642d18eee2cd (nohz: Affine unpinned timers to housekeepers) and commit 
> > 444969223c8 ("sched/nohz: Fix affine unpinned timers mess"). However, 
> > KVM always makes lapic timer pinned to the pCPU which vCPU residents, the 
> > reason is explained by commit 61abdbe0 (kvm: x86: make lapic hrtimer 
> > pinned). Actually, these emulated timers can be offload to the housekeeping 
> > cpus since APICv is really common in recent years. The guest timer 
> > interrupt 
> > is injected by posted-interrupt which is delivered by housekeeping cpu 
> > once the emulated timer fires. 
> > 
> > The host admin should fine tuned, e.g. dedicated instances scenario w/ 
> > nohz_full cover the pCPUs which vCPUs resident, several pCPUs surplus 
> > for busy housekeeping, disable mwait/hlt/pause vmexits to keep in non-root  
> > mode, ~3% redis performance benefit can be observed on Skylake server.
> 
> Marcelo,
> 
> does this patch work for you or can you still see the oops?

Hi Paolo,

No more oopses with kvm/queue. Can you include:

Index: kvm/arch/x86/kvm/lapic.c
===
--- kvm.orig/arch/x86/kvm/lapic.c
+++ kvm/arch/x86/kvm/lapic.c
@@ -124,8 +124,7 @@ static inline u32 kvm_x2apic_id(struct k
 
 bool posted_interrupt_inject_timer(struct kvm_vcpu *vcpu)
 {
-   return pi_inject_timer && kvm_vcpu_apicv_active(vcpu) &&
-   kvm_hlt_in_guest(vcpu->kvm);
+   return pi_inject_timer && kvm_vcpu_apicv_active(vcpu);
 }
 EXPORT_SYMBOL_GPL(posted_interrupt_inject_timer);
 
However, for some reason (hrtimer subsystems responsability) with cyclictest -i 
200
on the guest, the timer runs on the local CPU:

   CPU 1/KVM-9454  [003] d..2   881.674196: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d..2   881.674200: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d.h.   881.674387: apic_timer_fn 
<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   881.674393: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d..2   881.674395: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d..2   881.674399: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d.h.   881.674586: apic_timer_fn 
<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   881.674593: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d..2   881.674595: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d..2   881.674599: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d.h.   881.674787: apic_timer_fn 
<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   881.674793: get_nohz_timer_target: 
get_nohz_timer_target 3->0
   CPU 1/KVM-9454  [003] d..2   881.674795: get_nohz_timer_target: 
get_nohz_timer_target 3->0

But on boot:

   CPU 1/KVM-9454  [003] d..2   578.625394: get_nohz_timer_target: 
get_nohz_timer_target 3->0
  -0 [000] d.h1   578.626390: apic_timer_fn 
<-__hrtimer_run_queues
  -0 [000] d.h1   578.626394: 
apic_timer_fn<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   578.626401: get_nohz_timer_target: 
get_nohz_timer_target 3->0
  -0 [000] d.h1   578.628397: apic_timer_fn 
<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   578.628407: get_nohz_timer_target: 
get_nohz_timer_target 3->0
  -0 [000] d.h1   578.631403: apic_timer_fn 
<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   578.631413: get_nohz_timer_target: 
get_nohz_timer_target 3->0
  -0 [000] d.h1   578.635409: apic_timer_fn 
<-__hrtimer_run_queues
   CPU 1/KVM-9454  [003] d..2   578.635419: get_nohz_timer_target: 
get_nohz_timer_target 3->0
  -0 [000] d.h1   578.640415: apic_timer_fn 
<-__hrtimer_run_queues

Thanks.

Re: [PATCH v4 2/5] KVM: LAPIC: inject lapic timer interrupt by posted interrupt

2019-06-26 Thread Marcelo Tosatti

On Wed, Jun 26, 2019 at 07:02:13PM +0800, Wanpeng Li wrote:
> On Wed, 26 Jun 2019 at 03:03, Marcelo Tosatti  wrote:
> >
> > On Mon, Jun 24, 2019 at 04:53:53PM +0800, Wanpeng Li wrote:
> > > On Sat, 22 Jun 2019 at 06:11, Marcelo Tosatti  wrote:
> > > >
> > > > On Fri, Jun 21, 2019 at 09:42:39AM +0800, Wanpeng Li wrote:
> > > > > On Thu, 20 Jun 2019 at 05:04, Marcelo Tosatti  
> > > > > wrote:
> > > > > >
> > > > > > Hi Li,
> > > > > >
> > > > > > On Wed, Jun 19, 2019 at 08:36:06AM +0800, Wanpeng Li wrote:
> > > > > > > On Tue, 18 Jun 2019 at 21:36, Marcelo Tosatti 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, Jun 17, 2019 at 07:24:44PM +0800, Wanpeng Li wrote:
> > > > > > > > > From: Wanpeng Li 
> > > > > > > > >
> > > > > > > > > Dedicated instances are currently disturbed by unnecessary 
> > > > > > > > > jitter due
> > > > > > > > > to the emulated lapic timers fire on the same pCPUs which 
> > > > > > > > > vCPUs resident.
> > > > > > > > > There is no hardware virtual timer on Intel for guest like 
> > > > > > > > > ARM. Both
> > > > > > > > > programming timer in guest and the emulated timer fires incur 
> > > > > > > > > vmexits.
> > > > > > > > > This patch tries to avoid vmexit which is incurred by the 
> > > > > > > > > emulated
> > > > > > > > > timer fires in dedicated instance scenario.
> > > > > > > > >
> > > > > > > > > When nohz_full is enabled in dedicated instances scenario, 
> > > > > > > > > the emulated
> > > > > > > > > timers can be offload to the nearest busy housekeeping cpus 
> > > > > > > > > since APICv
> > > > > > > > > is really common in recent years. The guest timer interrupt 
> > > > > > > > > is injected
> > > > > > > > > by posted-interrupt which is delivered by housekeeping cpu 
> > > > > > > > > once the emulated
> > > > > > > > > timer fires.
> > > > > > > > >
> > > > > > > > > The host admin should fine tuned, e.g. dedicated instances 
> > > > > > > > > scenario w/
> > > > > > > > > nohz_full cover the pCPUs which vCPUs resident, several pCPUs 
> > > > > > > > > surplus
> > > > > > > > > for busy housekeeping, disable mwait/hlt/pause vmexits to 
> > > > > > > > > keep in non-root
> > > > > > > > > mode, ~3% redis performance benefit can be observed on 
> > > > > > > > > Skylake server.
> > > > > > > > >
> > > > > > > > > w/o patch:
> > > > > > > > >
> > > > > > > > > VM-EXIT  Samples  Samples%  Time%   Min Time  Max 
> > > > > > > > > Time   Avg time
> > > > > > > > >
> > > > > > > > > EXTERNAL_INTERRUPT4291649.43%   39.30%   0.47us   
> > > > > > > > > 106.09us   0.71us ( +-   1.09% )
> > > > > > > > >
> > > > > > > > > w/ patch:
> > > > > > > > >
> > > > > > > > > VM-EXIT  Samples  Samples%  Time%   Min Time  Max 
> > > > > > > > > Time Avg time
> > > > > > > > >
> > > > > > > > > EXTERNAL_INTERRUPT6871 9.29% 2.96%   0.44us
> > > > > > > > > 57.88us   0.72us ( +-   4.02% )
> > > > > > > > >
> > > > > > > > > Cc: Paolo Bonzini 
> > > > > > > > > Cc: Radim Krčmář 
> > > > > > > > > Cc: Marcelo Tosatti 
> > > > > > > > > Signed-off-by: Wanpeng Li 
> > > > > > > > > ---
> > > > > > > > >  arch/x86/kvm/lapic.c| 33 
> > > > > > > > >

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2527 matches

Mail list logo