On Wed, May 14, 2025 at 05:26:45PM +0000, Michael Kelley wrote:
> From: Yury Norov <[email protected]> Sent: Wednesday, May 14, 2025 9:55 AM
> >
> > On Wed, May 14, 2025 at 04:53:34AM +0000, Michael Kelley wrote:
> > > > -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > > > +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > + bool skip_first_cpu)
> > > > {
> > > > const struct cpumask *next, *prev = cpu_none_mask;
> > > > cpumask_var_t cpus __free(free_cpumask_var);
> > > > @@ -1303,9 +1304,20 @@ static int irq_setup(unsigned int *irqs,
> > > > unsigned int len, int node)
> > > > while (weight > 0) {
> > > > cpumask_andnot(cpus, next, prev);
> > > > for_each_cpu(cpu, cpus) {
> > > > + /*
> > > > + * if the CPU sibling set is to be
> > > > skipped we
> > > > + * just move on to the next CPUs
> > > > without len--
> > > > + */
> > > > + if (unlikely(skip_first_cpu)) {
> > > > + skip_first_cpu = false;
> > > > + goto next_cpumask;
> > > > + }
> > > > +
> > > > if (len-- == 0)
> > > > goto done;
> > > > +
> > > > irq_set_affinity_and_hint(*irqs++,
> > > > topology_sibling_cpumask(cpu));
> > > > +next_cpumask:
> > > > cpumask_andnot(cpus, cpus,
> > > > topology_sibling_cpumask(cpu));
> > > > --weight;
> > > > }
> > >
> > > With a little bit of reordering of the code, you could avoid the need for
> > > the "next_cpumask"
> > > label and goto statement. "continue" is usually cleaner than a "goto".
> > > Here's what I'm thinking:
> > >
> > > for_each_cpu(cpu, cpus) {
> > > cpumask_andnot(cpus, cpus,
> > > topology_sibling_cpumask(cpu));
> > > --weight;
> >
> > cpumask_andnot() is O(N), and before it was conditional on 'len == 0',
> > so we didn't do that on the very last step. Your version has to do that.
> > Don't know how important that is for real workloads. Shradha maybe can
> > measure it...
>
> Yes, there's one extra invocation of cpumask_andnot(). But if the
> VM has a small number of CPUs, that extra invocation is negligible.
> If the VM has a large number of CPUs, we're already executing
> cpumask_andnot() many times, so one extra time is also negligible.
> And this whole thing is done only at device initialization time, so
> it's not a hot path.
>
Hi Michael, Yury,
That's right, the overhead is negligible. Tested with some common
workloads. I will change this in the next version.
Shradha.
> >
> > >
> > > If (unlikely(skip_first_cpu)) {
> > > skip_first_cpu = false;
> > > continue;
> > > }
> > >
> > > If (len-- == 0)
> > > goto done;
> > >
> > > irq_set_affinity_and_hint(*irqs++,
> > > topology_sibling_cpumask(cpu));
> > > }
> > >
> > > I wish there were some comments in irq_setup() explaining the overall
> > > intention of
> > > the algorithm. I can see how the goal is to first assign CPUs that are
> > > local to the current
> > > NUMA node, and then expand outward to CPUs that are further away. And you
> > > want
> > > to *not* assign both siblings in a hyper-threaded core.
> >
> > I wrote this function, so let me step in.
> >
> > The intention is described in the corresponding commit message:
> >
> > Souradeep investigated that the driver performs faster if IRQs are
> > spread on CPUs with the following heuristics:
> >
> > 1. No more than one IRQ per CPU, if possible;
> > 2. NUMA locality is the second priority;
> > 3. Sibling dislocality is the last priority.
> >
> > Let's consider this topology:
> >
> > Node 0 1
> > Core 0 1 2 3
> > CPU 0 1 2 3 4 5 6 7
> >
> > The most performant IRQ distribution based on the above topology
> > and heuristics may look like this:
> >
> > IRQ Nodes Cores CPUs
> > 0 1 0 0-1
> > 1 1 1 2-3
> > 2 1 0 0-1
> > 3 1 1 2-3
> > 4 2 2 4-5
> > 5 2 3 6-7
> > 6 2 2 4-5
> > 7 2 3 6-7
> >
> > > But I can't figure out what
> > > "weight" is trying to accomplish. Maybe this was discussed when the code
> > > first
> > > went in, but I can't remember now. :-(
> >
> > The weight here is to implement the heuristic discovered by Souradeep:
> > NUMA locality is preferred over sibling dislocality.
> >
> > The outer for_each() loop resets the weight to the actual number of
> > CPUs in the hop. Then inner for_each() loop decrements it by the
> > number of sibling groups (cores) while assigning first IRQ to each
> > group.
> >
> > Now, because NUMA locality is more important, we should walk the
> > same set of siblings and assign 2nd IRQ, and it's implemented by the
> > medium while() loop. So, we do like this unless the number of IRQs
> > assigned on this hop will not become equal to number of CPUs in the
> > hop (weight == 0). Then we switch to the next hop and do the same
> > thing.
> >
> > Hope that helps.
>
> Yes, that helps! So the key to understanding "weight" is that
> NUMA locality is preferred over sibling dislocality.
>
> This is a great summary! All or most of it should go as a
> comment describing the function and what it is trying to do.
>
> Michael