From: Long Li <[email protected]> Sent: Wednesday, October 8, 2025 10:20 AM
> 
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel 
> > with the
> > same CPU as on the I/O issuing CPU
> >
> > From: Long Li <[email protected]> Sent: Tuesday, October 7, 2025 5:56 PM
> > >
> > > > -----Original Message-----
> > > > From: Michael Kelley <[email protected]>
> > > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > > To: [email protected]; KY Srinivasan <[email protected]>;
> > > > Haiyang Zhang <[email protected]>; Wei Liu
> > > > <[email protected]>; Dexuan Cui <[email protected]>; James E.J.
> > > > Bottomley <[email protected]>; Martin K.
> > > > Petersen <[email protected]>; James Bottomley
> > > > <[email protected]>; [email protected];
> > > > [email protected]; linux- [email protected]
> > > > Cc: Long Li <[email protected]>
> > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > > channel with the same CPU as on the I/O issuing CPU
> > > >
> > > > From: [email protected] <[email protected]> Sent:
> > > > Wednesday, October 1, 2025 10:06 PM
> > > > >
> > > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > > select a channel with a returning CPU that is not the same as
> > > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > > well when the Hyper-V exposes a large number of channels (up to
> > > > > the number of all CPUs). Use a different CPU for returning channel is 
> > > > > not efficient on Hyper-V.
> > > > >
> > > > > Change this behavior by preferring to the channel with the same
> > > > > CPU as the current I/O issuing CPU whenever possible.
> > > > >
> > > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > > and no regression with older Hyper-V/Azure environments.
> > > > >
> > > > > Tested-by: Raheel Abdul Faizy <[email protected]>
> > > > > Signed-off-by: Long Li <[email protected]>
> > > > > ---
> > > > >  drivers/scsi/storvsc_drv.c | 96
> > > > > ++++++++++++++++++--------------------
> > > > >  1 file changed, 45 insertions(+), 51 deletions(-)
> > > > >
> > > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > > 100644
> > > > > --- a/drivers/scsi/storvsc_drv.c
> > > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel 
> > > > > *get_og_chn(struct storvsc_device *stor_device,
> > > > >       }
> > > > >
> > > > >       /*
> > > > > -      * Our channel array is sparsley populated and we
> > > > > +      * Our channel array could be sparsley populated and we
> > > > >        * initiated I/O on a processor/hw-q that does not
> > > > >        * currently have a designated channel. Fix this.
> > > > >        * The strategy is simple:
> > > > > -      * I. Ensure NUMA locality
> > > > > -      * II. Distribute evenly (best effort)
> > > > > +      * I. Prefer the channel associated with the current CPU
> > > > > +      * II. Ensure NUMA locality
> > > > > +      * III. Distribute evenly (best effort)
> > > > >        */
> > > > >
> > > > > +     /* Prefer the channel on the I/O issuing processor/hw-q */
> > > > > +     if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > > +             return stor_device->stor_chns[q_num];
> > > > > +
> > > >
> > > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > > >the non-
> > > > NULL case. And the checks are all done with stor_device->lock held,
> > > > so the stor_chns array can't change.
> > > > Hence the above code will return NULL, which will cause a NULL
> > > > reference when
> > > > storvsc_do_io() sends out the VMBus packet.
> > > >
> > > > My recollection is that get_og_chan() is called when there is no
> > > > channel that interrupts the current CPU (that's what it means for
> > > > stor_device-
> > > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > > >channel
> > > > that interrupts some other CPU, preferably a CPU in the current NUMA 
> > > > node.
> > > > Adding code to prefer the channel associated with the current CPU
> > > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > > when it is already known that there is no such channel.
> > >
> > > The initial values for stor_chns[] and alloced_cpus are set in
> > > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> > subchannels).
> >
> > OK, I agree that if the CPU bit in alloced_cpus is set, then the 
> > corresponding entry
> > in stor_chns[] will not be NULL.  And if the entry in stor_chns[] is NULL, 
> > the CPU
> > bit in alloced_cpus is *not* set. All the places that manipulate these 
> > fields update
> > both so they are in sync with each other.  Hence I'll agree the code you've 
> > added
> > in get_og_chn() will never return a NULL value.
> >
> > (However, FWIW the reverse is not true:  If the entry in stor_chns[] is not 
> > NULL,
> > the corresponding CPU bit in alloced_cpus may or may not be set.)
> >
> > >
> > > As a result, the check for cpumask_test_cpu(q_num,
> > > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > > If the check fails, the code follows the old behavior to find a channel.
> > >
> > > This check is needed because storvsc supports
> > > change_target_cpu_callback() callback via vmbus.
> >
> > But look at the code in storvsc_do_io() where get_og_chn() is called. I've 
> > copied
> > the code here for discussion purposes. This is the only place that 
> > get_og_chn() is
> > called:
> >
> >                 spin_lock_irqsave(&stor_device->lock, flags);
> >                 outgoing_channel = stor_device->stor_chns[q_num];
> >                 if (outgoing_channel != NULL) {
> >                         spin_unlock_irqrestore(&stor_device->lock, flags);
> >                         goto found_channel;
> >                 }
> >                 outgoing_channel = get_og_chn(stor_device, q_num);
> >                 spin_unlock_irqrestore(&stor_device->lock, flags);
> >
> > The code gets the spin lock, then reads the stor_chns[] entry. If the entry 
> > is non-
> > NULL, then we've found a suitable channel and get_og_chn() is *not* called. 
> > The
> > only time get_og_chan() is called is when the stor_chn[] entry
> > *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> > So the check you've added in get_og_chn() can never be true and the check 
> > is not
> > needed. You said the check is needed because of 
> > change_target_cpu_callback(),
> > which will invoke storvsc_change_target_cpu(). But I don't see how that 
> > matters
> > given the checks done before get_og_chn() is called. The spin lock 
> > synchronizes
> > with any changes made by storvsc_change_target_cpu().
> 
> I agree this check may not be necessary. Given the current patch is correct, 
> how about I
> submit another patch to remove this check after more testing are done?

That's OK with me.

> 
> >
> > Related, there are a couple of occurrences of code like this:
> >
> >                 for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> >                                   q_num + 1) {
> >                         channel = 
> > READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> >                         if (!channel)
> >                                 continue;
> >
> > Given your point that the alloced_cpus and stor_chns[] entries are always 
> > in sync
> > with each other, the check for channel being NULL is not necessary.
> 
> I don't' agree with removing the check for NULL. This code follows the 
> existing storvsc behavior
> on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it 
> changes stor_chns
> before changing alloced_cpus. It can run at the same time as this code. I 
> think we may need
> to add some memory barriers in storvsc_change_target_cpu(), but that is for 
> another topic.

Yes, you are right. The spin lock isn't held in these cases, so the checks for 
NULL
should stay.

Michael

Reply via email to