* Linus Torvalds <torva...@linux-foundation.org> wrote:

> On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <pet...@infradead.org> wrote:
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> >
> > It's just that people have been bugging me for this crap; and I figure
> > I'd post it now that it's not exploding anymore and let others have at.
> 
> The patches didn't look disgusting to me, but I admittedly just
> scanned through them quickly.
> 
> Are there downsides (maintenance and/or performance) when core
> scheduling _isn't_ enabled? I guess if it's not a maintenance or
> performance nightmare when off, it's ok to just give people the
> option.

So this bit is the main straight-line performance impact when the 
CONFIG_SCHED_CORE Kconfig feature is present (which I expect distros to 
enable broadly):

  +static inline bool sched_core_enabled(struct rq *rq)
  +{
  +       return static_branch_unlikely(&__sched_core_enabled) && 
rq->core_enabled;
  +}

   static inline raw_spinlock_t *rq_lockp(struct rq *rq)
   {
  +       if (sched_core_enabled(rq))
  +               return &rq->core->__lock
  +
          return &rq->__lock;


This should at least in principe keep the runtime overhead down to more 
NOPs and a bit bigger instruction cache footprint - modulo compiler 
shenanigans.

Here's the code generation impact on x86-64 defconfig:

   text    data     bss     dec     hex filename
    228      48       0     276     114 sched.core.n/cpufreq.o (ex 
sched.core.n/built-in.a)
    228      48       0     276     114 sched.core.y/cpufreq.o (ex 
sched.core.y/built-in.a)

   4438      96       0    4534    11b6 sched.core.n/completion.o (ex 
sched.core.n/built-in.a)
   4438      96       0    4534    11b6 sched.core.y/completion.o (ex 
sched.core.y/built-in.a)

   2167    2428       0    4595    11f3 sched.core.n/cpuacct.o (ex 
sched.core.n/built-in.a)
   2167    2428       0    4595    11f3 sched.core.y/cpuacct.o (ex 
sched.core.y/built-in.a)

  61099   22114     488   83701   146f5 sched.core.n/core.o (ex 
sched.core.n/built-in.a)
  70541   25370     508   96419   178a3 sched.core.y/core.o (ex 
sched.core.y/built-in.a)

   3262    6272       0    9534    253e sched.core.n/wait_bit.o (ex 
sched.core.n/built-in.a)
   3262    6272       0    9534    253e sched.core.y/wait_bit.o (ex 
sched.core.y/built-in.a)

  12235     341      96   12672    3180 sched.core.n/rt.o (ex 
sched.core.n/built-in.a)
  13073     917      96   14086    3706 sched.core.y/rt.o (ex 
sched.core.y/built-in.a)

  10293     477    1928   12698    319a sched.core.n/topology.o (ex 
sched.core.n/built-in.a)
  10363     509    1928   12800    3200 sched.core.y/topology.o (ex 
sched.core.y/built-in.a)

    886      24       0     910     38e sched.core.n/cpupri.o (ex 
sched.core.n/built-in.a)
    886      24       0     910     38e sched.core.y/cpupri.o (ex 
sched.core.y/built-in.a)

   1061      64       0    1125     465 sched.core.n/stop_task.o (ex 
sched.core.n/built-in.a)
   1077     128       0    1205     4b5 sched.core.y/stop_task.o (ex 
sched.core.y/built-in.a)

  18443     365      24   18832    4990 sched.core.n/deadline.o (ex 
sched.core.n/built-in.a)
  20019    2189      24   22232    56d8 sched.core.y/deadline.o (ex 
sched.core.y/built-in.a)

   1123       8      64    1195     4ab sched.core.n/loadavg.o (ex 
sched.core.n/built-in.a)
   1123       8      64    1195     4ab sched.core.y/loadavg.o (ex 
sched.core.y/built-in.a)

   1323       8       0    1331     533 sched.core.n/stats.o (ex 
sched.core.n/built-in.a)
   1323       8       0    1331     533 sched.core.y/stats.o (ex 
sched.core.y/built-in.a)

   1282     164      32    1478     5c6 sched.core.n/isolation.o (ex 
sched.core.n/built-in.a)
   1282     164      32    1478     5c6 sched.core.y/isolation.o (ex 
sched.core.y/built-in.a)

   1564      36       0    1600     640 sched.core.n/cpudeadline.o (ex 
sched.core.n/built-in.a)
   1564      36       0    1600     640 sched.core.y/cpudeadline.o (ex 
sched.core.y/built-in.a)

   1640      56       0    1696     6a0 sched.core.n/swait.o (ex 
sched.core.n/built-in.a)
   1640      56       0    1696     6a0 sched.core.y/swait.o (ex 
sched.core.y/built-in.a)

   1859     244      32    2135     857 sched.core.n/clock.o (ex 
sched.core.n/built-in.a)
   1859     244      32    2135     857 sched.core.y/clock.o (ex 
sched.core.y/built-in.a)

   2339       8       0    2347     92b sched.core.n/cputime.o (ex 
sched.core.n/built-in.a)
   2339       8       0    2347     92b sched.core.y/cputime.o (ex 
sched.core.y/built-in.a)

   3014      32       0    3046     be6 sched.core.n/membarrier.o (ex 
sched.core.n/built-in.a)
   3014      32       0    3046     be6 sched.core.y/membarrier.o (ex 
sched.core.y/built-in.a)

  50027     964      96   51087    c78f sched.core.n/fair.o (ex 
sched.core.n/built-in.a)
  51537    2484      96   54117    d365 sched.core.y/fair.o (ex 
sched.core.y/built-in.a)

   3192     220       0    3412     d54 sched.core.n/idle.o (ex 
sched.core.n/built-in.a)
   3276     252       0    3528     dc8 sched.core.y/idle.o (ex 
sched.core.y/built-in.a)

   3633       0       0    3633     e31 sched.core.n/pelt.o (ex 
sched.core.n/built-in.a)
   3633       0       0    3633     e31 sched.core.y/pelt.o (ex 
sched.core.y/built-in.a)

   3794     160       0    3954     f72 sched.core.n/wait.o (ex 
sched.core.n/built-in.a)
   3794     160       0    3954     f72 sched.core.y/wait.o (ex 
sched.core.y/built-in.a)

I'd say this one is representative:

   text    data     bss     dec     hex filename
  12235     341      96   12672    3180 sched.core.n/rt.o (ex 
sched.core.n/built-in.a)
  13073     917      96   14086    3706 sched.core.y/rt.o (ex 
sched.core.y/built-in.a)

which ~6% bloat is primarily due to the higher rq-lock inlining overhead, 
I believe.

This is roughly what you'd expect from a change wrapping all 350+ inlined 
instantiations of rq->lock uses. I.e. it might make sense to uninline it.

In terms of long term maintenance overhead, ignoring the overhead of the 
core-scheduling feature itself, the rq-lock wrappery is the biggest 
ugliness, the rest is mostly isolated.

So if this actually *works* and improves the performance of some real 
VMEXIT-poor SMT workloads and allows the enabling of HyperThreading with 
untrusted VMs without inviting thousands of guest roots then I'm 
cautiously in support of it.

> That all assumes that it works at all for the people who are clamoring 
> for this feature, but I guess they can run some loads on it eventually. 
> It's a holiday in the US right now ("Presidents' Day"), but maybe we 
> can get some numebrs this week?

Such numbers would be *very* helpful indeed.

Thanks,

        Ingo

Reply via email to