From: Long Li <[email protected]> Sent: Tuesday, October 7, 2025 5:56 PM
> 
> > -----Original Message-----
> > From: Michael Kelley <[email protected]>
> > Sent: Tuesday, October 7, 2025 8:42 AM
> > To: [email protected]; KY Srinivasan <[email protected]>; Haiyang
> > Zhang <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
> > <[email protected]>; James E.J. Bottomley
> > <[email protected]>; Martin K. Petersen
> > <[email protected]>; James Bottomley <[email protected]>;
> > [email protected]; [email protected]; linux-
> > [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel 
> > with the
> > same CPU as on the I/O issuing CPU
> >
> > From: [email protected] <[email protected]> Sent:
> > Wednesday, October 1, 2025 10:06 PM
> > >
> > > When selecting an outgoing channel for I/O, storvsc tries to select a
> > > channel with a returning CPU that is not the same as issuing CPU. This
> > > worked well in the past, however it doesn't work well when the Hyper-V
> > > exposes a large number of channels (up to the number of all CPUs). Use
> > > a different CPU for returning channel is not efficient on Hyper-V.
> > >
> > > Change this behavior by preferring to the channel with the same CPU as
> > > the current I/O issuing CPU whenever possible.
> > >
> > > Tests have shown improvements in newer Hyper-V/Azure environment, and
> > > no regression with older Hyper-V/Azure environments.
> > >
> > > Tested-by: Raheel Abdul Faizy <[email protected]>
> > > Signed-off-by: Long Li <[email protected]>
> > > ---
> > >  drivers/scsi/storvsc_drv.c | 96
> > > ++++++++++++++++++--------------------
> > >  1 file changed, 45 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index d9e59204a9c3..092939791ea0 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct 
> > > storvsc_device *stor_device,
> > >   }
> > >
> > >   /*
> > > -  * Our channel array is sparsley populated and we
> > > +  * Our channel array could be sparsley populated and we
> > >    * initiated I/O on a processor/hw-q that does not
> > >    * currently have a designated channel. Fix this.
> > >    * The strategy is simple:
> > > -  * I. Ensure NUMA locality
> > > -  * II. Distribute evenly (best effort)
> > > +  * I. Prefer the channel associated with the current CPU
> > > +  * II. Ensure NUMA locality
> > > +  * III. Distribute evenly (best effort)
> > >    */
> > >
> > > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > +         return stor_device->stor_chns[q_num];
> > > +
> >
> > Hmmm. When get_og_chn() is called, we know that stor_device-
> > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non-
> > NULL case. And the checks are all done with stor_device->lock held, so the
> > stor_chns array can't change.
> > Hence the above code will return NULL, which will cause a NULL reference 
> > when
> > storvsc_do_io() sends out the VMBus packet.
> >
> > My recollection is that get_og_chan() is called when there is no channel 
> > that
> > interrupts the current CPU (that's what it means for stor_device-
> > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel
> > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > Adding code to prefer the channel associated with the current CPU doesn't 
> > make
> > sense in get_og_chn(), as get_og_chn() is only called when it is already 
> > known
> > that there is no such channel.
> 
> The initial values for stor_chns[] and alloced_cpus are set in 
> storvsc_channel_init() (for
> primary channel) and handle_sc_creation() (for subchannels).

OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding 
entry in
stor_chns[] will not be NULL.  And if the entry in stor_chns[] is NULL, the CPU 
bit in
alloced_cpus is *not* set. All the places that manipulate these fields update 
both so
they are in sync with each other.  Hence I'll agree the code you've added in 
get_og_chn()
will never return a NULL value.

(However, FWIW the reverse is not true:  If the entry in stor_chns[] is not 
NULL, the
corresponding CPU bit in alloced_cpus may or may not be set.)

> 
> As a result, the check for cpumask_test_cpu(q_num, 
> &stor_device->alloced_cpus) will
> guarantee we are getting a channel. If the check fails, the code follows the 
> old behavior
> to find a channel.
> 
> This check is needed because storvsc supports change_target_cpu_callback() 
> callback
> via vmbus.

But look at the code in storvsc_do_io() where get_og_chn() is called. I've 
copied the
code here for discussion purposes. This is the only place that get_og_chn() is 
called:

                spin_lock_irqsave(&stor_device->lock, flags);
                outgoing_channel = stor_device->stor_chns[q_num];
                if (outgoing_channel != NULL) {
                        spin_unlock_irqrestore(&stor_device->lock, flags);
                        goto found_channel;
                }
                outgoing_channel = get_og_chn(stor_device, q_num);
                spin_unlock_irqrestore(&stor_device->lock, flags);

The code gets the spin lock, then reads the stor_chns[] entry. If the entry is
non-NULL, then we've found a suitable channel and get_og_chn() is *not*
called. The only time get_og_chan() is called is when the stor_chn[] entry
*is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
So the check you've added in get_og_chn() can never be true and the check
is not needed. You said the check is needed because of 
change_target_cpu_callback(), which will invoke
storvsc_change_target_cpu(). But I don't see how that matters given the
checks done before get_og_chn() is called. The spin lock synchronizes with
any changes made by storvsc_change_target_cpu().

Related, there are a couple of occurrences of code like this:

                for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
                                  q_num + 1) {
                        channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
                        if (!channel)
                                continue;

Given your point that the alloced_cpus and stor_chns[] entries are
always in sync with each other, the check for channel being NULL
is not necessary.

Michael

Reply via email to