Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-04-15 Thread Nitesh Narayan Lal


On 4/14/21 12:11 PM, Jesse Brandeburg wrote:
> Nitesh Narayan Lal wrote:
>
>>> The original issue as seen, was that if you rmmod/insmod a driver
>>> *without* irqbalance running, the default irq mask is -1, which means
>>> any CPU. The older kernels (this issue was patched in 2014) used to use
>>> that affinity mask, but the value programmed into all the interrupt
>>> registers "actual affinity" would end up delivering all interrupts to
>>> CPU0,
>> So does that mean the affinity mask for the IRQs was different wrt where
>> the IRQs were actually delivered?
>> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
>> to 0 instead of -1?
> The smp_affinity was 0xfff, and the kernel chooses which interrupt to
> place the interrupt on, among any of the bits set.


I think what you are referring to here is the effective affinity mask.
>From one of Thomas's commit message that you pointed me to:

"The affinity mask is either the system-wide default or set by userspace,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs."

I was looking into the code changes around IRQ and there has been major
rework from Thomas in 2017. I recently tried testing the kernel just before
those patches got merged.

Specifically on top of
05161b9cbe5:     x86/irq: Get rid of the 'first_system_vector' indirection
                 bogosity

On the box where I tested this, I was able to see the effective affinity
being set to 0 (not SMP affinity) for several network device IRQs.

and I think the reason for it is the usage of "cpumask_first_and(mask,
cpu_online_mask)" in __assign_irq_vector().

But with the latest kernel, this has been replaced and that's why I didn't
see the effective affinity being set to only 0 for all of the device IRQs.

Having said that I am still not sure if I was able to mimic what you have
seen in the past. But it looked similar to what you were explaining.
What do you think?



>
>  
>> I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
>> mask before removing the kernel module and after doing rmmod+insmod
>> and didn't find any difference.
> with the patch in question removed? Sorry, I'm confused what you tried.

Yeah, but I was only referring to the SMP affinity mask. Please see more
up-to-date testing results above.

>>>  and if the machine was under traffic load incoming when the
>>> driver loaded, CPU0 would start to poll among all the different netdev
>>> queues, all on CPU0.
>>>
>>> The above then leads to the condition that the device is stuck polling
>>> even if the affinity gets updated from user space, and the polling will
>>> continue until traffic stops.
>>>

[...]

>> As we can see in the above trace the initial affinity for the IRQ 1478 was
>> correctly set as per the default_smp_affinity mask which includes CPU 42,
>> however, later on, it is updated with CPU3 which is returned from
>> cpumask_local_spread().
>>
>>> Maybe the right thing is to fix which CPUs are passed in as the valid
>>> mask, or make sure the kernel cross checks that what the driver asks
>>> for is a "valid CPU"?
>>>
>> Sure, if we can still reproduce the problem that your patch was fixing then
>> maybe we can consider adding a new API like cpumask_local_spread_irq in
>> which we should consider deafult_smp_affinity mask as well before returning
>> the CPU.
> I'm sure I don't have a reproducer of the original problem any more, it
> is lost somewhere 8 years ago. I'd like to be able to repro the original
> issue, but I can't.
>
> Your description of the problem makes it obvious there is an issue. It
> appears as if cpumask_local_spread() is the wrong function to use here.
> If you have any suggestions please let me know.
>
> We had one other report of this problem as well (I'm not sure if it's
> the same as your report)
> https://lkml.org/lkml/2021/3/28/206
> https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210125/023120.html


How about we introduce a new API just for IRQ spreading,
cpumask_local_spread_irq() and then utilize the default_smp_affinity mask
in that before returning the CPU?

Although, I think the right way to deal with this would be to fix this from
the source that is where the CPU mask is assigned to an IRQ for the very
first time.


-- 
Thanks
Nitesh



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-04-08 Thread Nitesh Narayan Lal


On 4/7/21 11:18 AM, Nitesh Narayan Lal wrote:
> On 4/6/21 1:22 PM, Jesse Brandeburg wrote:
>> Continuing a thread from a bit ago...
>>
>> Nitesh Narayan Lal wrote:
>>
>>>> After a little more digging, I found out why cpumask_local_spread change
>>>> affects the general/initial smp_affinity for certain device IRQs.
>>>>
>>>> After the introduction of the commit:
>>>>
>>>>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>>
>>> Continuing the conversation about the above commit and adding Jesse.
>>> I was trying to understand the problem that the commit message explains
>>> "The default behavior of the kernel is somewhat undesirable as all
>>> requested interrupts end up on CPU0 after registration.", I have also been
>>> trying to reproduce this behavior without the patch but I failed in doing
>>> so, maybe because I am missing something here.
>>>
>>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>>> the default affinity mask.
> Thanks, Jesse for responding.
>
>> The original issue as seen, was that if you rmmod/insmod a driver
>> *without* irqbalance running, the default irq mask is -1, which means
>> any CPU. The older kernels (this issue was patched in 2014) used to use
>> that affinity mask, but the value programmed into all the interrupt
>> registers "actual affinity" would end up delivering all interrupts to
>> CPU0,
> So does that mean the affinity mask for the IRQs was different wrt where
> the IRQs were actually delivered?
> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
> to 0 instead of -1?
>
> I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
> mask before removing the kernel module and after doing rmmod+insmod
> and didn't find any difference.
>
>>  and if the machine was under traffic load incoming when the
>> driver loaded, CPU0 would start to poll among all the different netdev
>> queues, all on CPU0.
>>
>> The above then leads to the condition that the device is stuck polling
>> even if the affinity gets updated from user space, and the polling will
>> continue until traffic stops.
>>
>>> The problem with the commit is that when we overwrite the affinity mask
>>> based on the hinting mask we completely ignore the default SMP affinity
>>> mask. If we do want to overwrite the affinity based on the hint mask we
>>> should atleast consider the default SMP affinity.
> For the issue where the IRQs don't follow the default_smp_affinity mask
> because of this patch, the following are the steps by which it can be easily
> reproduced with the latest linux kernel:
>
> # Kernel
> 5.12.0-rc6+
>
> # Other pramaeters in the cmdline
> isolcpus=2-39,44-79 nohz=on nohz_full=2-39,44-79
> rcu_nocbs=2-39,44-79
>
> # cat /proc/irq/default_smp_affinity
> ,0f00,0003 [Corresponds to HK CPUs - 0, 1, 40, 41, 42 and 43]
>
> # Create VFs and check IRQ affinity mask
>
> /proc/irq/1423/iavf-ens1f1v3-TxRx-3
> 3
> /proc/irq/1424/iavf-:3b:0b.0:mbx
> 0
> 40
> 42
> /proc/irq/1425/iavf-ens1f1v8-TxRx-0
> 0
> /proc/irq/1426/iavf-ens1f1v8-TxRx-1
> 1
> /proc/irq/1427/iavf-ens1f1v8-TxRx-2
> 2
> /proc/irq/1428/iavf-ens1f1v8-TxRx-3
> 3
> ...
> /proc/irq/1475/iavf-ens1f1v15-TxRx-0
> 0
> /proc/irq/1476/iavf-ens1f1v15-TxRx-1
> 1
> /proc/irq/1477/iavf-ens1f1v15-TxRx-2
> 2
> /proc/irq/1478/iavf-ens1f1v15-TxRx-3
> 3
> /proc/irq/1479/iavf-:3b:0a.0:mbx
> 0
> 40
> 42
> ...
> /proc/irq/240/iavf-ens1f1v3-TxRx-0
> 0
> /proc/irq/248/iavf-ens1f1v3-TxRx-1
> 1
> /proc/irq/249/iavf-ens1f1v3-TxRx-2
> 2
>
>
> Trace dump:
> --
> ..
> 11551082:  NetworkManager-1734  [040]  8167.465719: vector_activate:    
>             irq=1478 is_managed=0 can_reserve=1 reserve=0
> 11551090:  NetworkManager-1734  [040]  8167.465720: vector_alloc:
>             irq=1478 vector=65 reserved=1 ret=0
> 11551093:  NetworkManager-1734  [040]  8167.465721: vector_update:      
>             irq=1478 vector=65 cpu=42 prev_vector=0 prev_cpu=0
> 11551097:  NetworkManager-1734  [040]  8167.465721: vector_config:      
>             irq=1478 vector=65 cpu=42 apicdest=0x0200
> 11551357:  NetworkManager-1734  [040]  8167.465768: vector_alloc:        
>             irq=1478 vector=46 reserved=0 ret=0
>
> 11551360:  NetworkManager-1734  [040]  8167.465769: vector_update:      
>             irq=1478 vector=46 cpu=3 prev_vector=65 prev_cpu=42
>
> 11551364:  NetworkManager-

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-04-07 Thread Nitesh Narayan Lal


On 4/6/21 1:22 PM, Jesse Brandeburg wrote:
> Continuing a thread from a bit ago...
>
> Nitesh Narayan Lal wrote:
>
>>> After a little more digging, I found out why cpumask_local_spread change
>>> affects the general/initial smp_affinity for certain device IRQs.
>>>
>>> After the introduction of the commit:
>>>
>>>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>
>> Continuing the conversation about the above commit and adding Jesse.
>> I was trying to understand the problem that the commit message explains
>> "The default behavior of the kernel is somewhat undesirable as all
>> requested interrupts end up on CPU0 after registration.", I have also been
>> trying to reproduce this behavior without the patch but I failed in doing
>> so, maybe because I am missing something here.
>>
>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>> the default affinity mask.

Thanks, Jesse for responding.

> The original issue as seen, was that if you rmmod/insmod a driver
> *without* irqbalance running, the default irq mask is -1, which means
> any CPU. The older kernels (this issue was patched in 2014) used to use
> that affinity mask, but the value programmed into all the interrupt
> registers "actual affinity" would end up delivering all interrupts to
> CPU0,

So does that mean the affinity mask for the IRQs was different wrt where
the IRQs were actually delivered?
Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
to 0 instead of -1?

I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
mask before removing the kernel module and after doing rmmod+insmod
and didn't find any difference.

>  and if the machine was under traffic load incoming when the
> driver loaded, CPU0 would start to poll among all the different netdev
> queues, all on CPU0.
>
> The above then leads to the condition that the device is stuck polling
> even if the affinity gets updated from user space, and the polling will
> continue until traffic stops.
>
>> The problem with the commit is that when we overwrite the affinity mask
>> based on the hinting mask we completely ignore the default SMP affinity
>> mask. If we do want to overwrite the affinity based on the hint mask we
>> should atleast consider the default SMP affinity.

For the issue where the IRQs don't follow the default_smp_affinity mask
because of this patch, the following are the steps by which it can be easily
reproduced with the latest linux kernel:

# Kernel
5.12.0-rc6+

# Other pramaeters in the cmdline
isolcpus=2-39,44-79 nohz=on nohz_full=2-39,44-79
rcu_nocbs=2-39,44-79

# cat /proc/irq/default_smp_affinity
,0f00,0003 [Corresponds to HK CPUs - 0, 1, 40, 41, 42 and 43]

# Create VFs and check IRQ affinity mask

/proc/irq/1423/iavf-ens1f1v3-TxRx-3
3
/proc/irq/1424/iavf-:3b:0b.0:mbx
0
40
42
/proc/irq/1425/iavf-ens1f1v8-TxRx-0
0
/proc/irq/1426/iavf-ens1f1v8-TxRx-1
1
/proc/irq/1427/iavf-ens1f1v8-TxRx-2
2
/proc/irq/1428/iavf-ens1f1v8-TxRx-3
3
...
/proc/irq/1475/iavf-ens1f1v15-TxRx-0
0
/proc/irq/1476/iavf-ens1f1v15-TxRx-1
1
/proc/irq/1477/iavf-ens1f1v15-TxRx-2
2
/proc/irq/1478/iavf-ens1f1v15-TxRx-3
3
/proc/irq/1479/iavf-:3b:0a.0:mbx
0
40
42
...
/proc/irq/240/iavf-ens1f1v3-TxRx-0
0
/proc/irq/248/iavf-ens1f1v3-TxRx-1
1
/proc/irq/249/iavf-ens1f1v3-TxRx-2
2


Trace dump:
--
..
11551082:  NetworkManager-1734  [040]  8167.465719: vector_activate:    
            irq=1478 is_managed=0 can_reserve=1 reserve=0
11551090:  NetworkManager-1734  [040]  8167.465720: vector_alloc:
            irq=1478 vector=65 reserved=1 ret=0
11551093:  NetworkManager-1734  [040]  8167.465721: vector_update:      
            irq=1478 vector=65 cpu=42 prev_vector=0 prev_cpu=0
11551097:  NetworkManager-1734  [040]  8167.465721: vector_config:      
            irq=1478 vector=65 cpu=42 apicdest=0x0200
11551357:  NetworkManager-1734  [040]  8167.465768: vector_alloc:        
            irq=1478 vector=46 reserved=0 ret=0

11551360:  NetworkManager-1734  [040]  8167.465769: vector_update:      
            irq=1478 vector=46 cpu=3 prev_vector=65 prev_cpu=42

11551364:  NetworkManager-1734  [040]  8167.465770: vector_config:      
            irq=1478 vector=46 cpu=3 apicdest=0x00040100
..

As we can see in the above trace the initial affinity for the IRQ 1478 was
correctly set as per the default_smp_affinity mask which includes CPU 42,
however, later on, it is updated with CPU3 which is returned from
cpumask_local_spread().

> Maybe the right thing is to fix which CPUs are passed in as the valid
> mask, or make sure the kernel cross checks that what the driver asks
> for is a "valid CPU"?
>

Sure, if we can still reproduce the problem that your patch was fixing then
maybe we can consider adding a new API like cpumask_local_spread_irq in
which we should consider deafult_smp_affinity mask as well before returning
the CPU.

-- 
Thanks
Nitesh



Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-03-04 Thread Nitesh Narayan Lal


On 3/4/21 4:13 PM, Alex Belits wrote:
> On 3/4/21 10:15, Nitesh Narayan Lal wrote:
>> External Email
>>
>> --
>>
>> On 2/11/21 10:55 AM, Nitesh Narayan Lal wrote:
>>> On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
>>>> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>>>>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>>>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>>>>
>>>>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>>>>> Does sounds like a nice idea to explore, lets see what Thomas thinks
>>>>>> about it.
>>> 
>>>
>>>>>> When the affinity mask of the interrupt at the time when it is actually
>>>>>> requested contains an isolated CPU then nothing prevents the kernel from
>>>>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>>>>> with that spreading thingy.
>>>>>>
>>>>>> The only difference which this change makes is the fact that the
>>>>>> affinity hint changes. Nothing else.
>>>>>>
>>>> Thanks for the detailed explanation.
>>>>
>>>> Before I posted this patch, I was doing some debugging on a setup where I
>>>> was observing some latency issues due to the iavf IRQs that were pinned on
>>>> the isolated CPUs.
>>>>
>>>> Based on some initial traces I had this impression that the affinity hint
>>>> or cpumask_local_spread was somehow playing a role in deciding the affinity
>>>> mask of these IRQs. Although, that does look incorrect after going through
>>>> your explanation.
>>>> For some reason, with a kernel that had this patch when I tried creating
>>>> VFs iavf IRQs always ended up on the HK CPUs.
>>>>
>>>> The reasoning for the above is still not very clear to me. I will
>>>> investigate
>>>> this further to properly understand this behavior.
>>>>
>>>>
>>> After a little more digging, I found out why cpumask_local_spread change
>>> affects the general/initial smp_affinity for certain device IRQs.
>>>
>>> After the introduction of the commit:
>>>
>>>  e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>
>>
>> Continuing the conversation about the above commit and adding Jesse.
>> I was trying to understand the problem that the commit message explains
>> "The default behavior of the kernel is somewhat undesirable as all
>> requested interrupts end up on CPU0 after registration.", I have also been
>> trying to reproduce this behavior without the patch but I failed in doing
>> so, maybe because I am missing something here.
>>
>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>> the default affinity mask.
>>
>> The problem with the commit is that when we overwrite the affinity mask
>> based on the hinting mask we completely ignore the default SMP affinity
>> mask. If we do want to overwrite the affinity based on the hint mask we
>> should atleast consider the default SMP affinity.
>>
>
> cpumask_local_spread() is used by a small number of drivers, mostly for
> Ethernet and cryptographic devices, however it includes Cavium and Marvell
> devices that were included in every piece of hardware that I and Yury Norov
> worked on. Without my patch (or previous, later replaced, Yury's patch that
> was developed before there were housekeeping CPUs), driver would completely
> break any attempts to configure task isolation, because it would distribute
> processing over CPUs regardless of any masks related to isolation (and later
> housekeeping). This is why it was created, and it just happens that it also
> makes sense for CPU isolation in general. Of course, none of it would be
> experienced on hardware that does not include those devices, possibly
> creating some wrong impression about its effect and purpose.
>
> It may be that my patch can be criticized for not accommodating CPU hotplug
> and other runtime changes of masks. Or drivers can be criticized for their
> behavior that relies on calling

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-03-04 Thread Nitesh Narayan Lal


On 2/11/21 10:55 AM, Nitesh Narayan Lal wrote:
> On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
>> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>>
>>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>>> Does sounds like a nice idea to explore, lets see what Thomas thinks about 
>>>> it.
> 
>
>>>> When the affinity mask of the interrupt at the time when it is actually
>>>> requested contains an isolated CPU then nothing prevents the kernel from
>>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>>> with that spreading thingy.
>>>>
>>>> The only difference which this change makes is the fact that the
>>>> affinity hint changes. Nothing else.
>>>>
>> Thanks for the detailed explanation.
>>
>> Before I posted this patch, I was doing some debugging on a setup where I
>> was observing some latency issues due to the iavf IRQs that were pinned on
>> the isolated CPUs.
>>
>> Based on some initial traces I had this impression that the affinity hint
>> or cpumask_local_spread was somehow playing a role in deciding the affinity
>> mask of these IRQs. Although, that does look incorrect after going through
>> your explanation.
>> For some reason, with a kernel that had this patch when I tried creating
>> VFs iavf IRQs always ended up on the HK CPUs.
>>
>> The reasoning for the above is still not very clear to me. I will investigate
>> this further to properly understand this behavior.
>>
>>
> After a little more digging, I found out why cpumask_local_spread change
> affects the general/initial smp_affinity for certain device IRQs.
>
> After the introduction of the commit:
>
>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>

Continuing the conversation about the above commit and adding Jesse.
I was trying to understand the problem that the commit message explains
"The default behavior of the kernel is somewhat undesirable as all
requested interrupts end up on CPU0 after registration.", I have also been
trying to reproduce this behavior without the patch but I failed in doing
so, maybe because I am missing something here.

@Jesse Can you please explain? FWIU IRQ affinity should be decided based on
the default affinity mask.

The problem with the commit is that when we overwrite the affinity mask
based on the hinting mask we completely ignore the default SMP affinity
mask. If we do want to overwrite the affinity based on the hint mask we
should atleast consider the default SMP affinity.

-- 
Thanks
Nitesh



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-11 Thread Nitesh Narayan Lal


On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>
>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>> Does sounds like a nice idea to explore, lets see what Thomas thinks about 
>>> it.



>>> When the affinity mask of the interrupt at the time when it is actually
>>> requested contains an isolated CPU then nothing prevents the kernel from
>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>> with that spreading thingy.
>>>
>>> The only difference which this change makes is the fact that the
>>> affinity hint changes. Nothing else.
>>>
> Thanks for the detailed explanation.
>
> Before I posted this patch, I was doing some debugging on a setup where I
> was observing some latency issues due to the iavf IRQs that were pinned on
> the isolated CPUs.
>
> Based on some initial traces I had this impression that the affinity hint
> or cpumask_local_spread was somehow playing a role in deciding the affinity
> mask of these IRQs. Although, that does look incorrect after going through
> your explanation.
> For some reason, with a kernel that had this patch when I tried creating
> VFs iavf IRQs always ended up on the HK CPUs.
>
> The reasoning for the above is still not very clear to me. I will investigate
> this further to properly understand this behavior.
>
>

After a little more digging, I found out why cpumask_local_spread change
affects the general/initial smp_affinity for certain device IRQs.

After the introduction of the commit:

    e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()

For all the drivers that set hint, initial affinity is set based on the
CPU retrieved from cpumask_local_spread. So in an environment where
irqbalance is disabled, these device IRQs remain on the CPUs that are
picked from cpumask_local_spread even though they are isolated. I think
the commit message of the reverted patch should have covered this as
well.

-- 
Thanks
Nitesh



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-06 Thread Nitesh Narayan Lal


On 2/5/21 5:23 PM, Thomas Gleixner wrote:
> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>> How about adding a new flag for isolcpus instead?
>>>>>
>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>> housekeeping for all the devices at the time of IRQ distribution?
>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>> Does sounds like a nice idea to explore, lets see what Thomas thinks about 
>> it.
> I just read back up on that whole discussion and stared into the usage
> sites a bit.
>
> There are a couple of issues here in a larger picture. Looking at it
> from the device side first:
>
> The spreading is done for non-managed queues/interrupts which makes them
> movable by user space. So it could be argued from both sides that the
> damage done by allowing the full online mask or by allowing only the
> house keeping mask can be fixed up by user space.
>
> But that's the trivial part of the problem. The real problem is CPU
> hotplug and offline CPUs and the way how interrupts are set up for their
> initial affinity.
>
> As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict
> cpumask_local_spread to houskeeping CPUs") is broken as it can return
> offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case.

A quick question here, is there any reason why we don't have cpu_online_mask
instead of cpu_possible_mask in the housekeeping_cpumask()?
(not for this particular patch but in general)

>
> The original code is racy vs. hotplug unless the callers block hotplug.
>
> Let's look at all the callers and what they do with it.
>
>   cptvf_set_irq_affinity() affinity hint
>   safexcel_request_ring_irq()  affinity hint
>   mv_cesa_probe()  affinity hint
>   bnxt_request_irq()   affinity hint
>   nicvf_set_irq_affinity() affinity hint
>   cxgb4_set_msix_aff() affinity hint
>   enic_init_affinity_hint(()   affinity hint
>   iavf_request_traffic_irqs()  affinity hint
>   ionic_alloc_qcq_interrupt()  affinity hint
>   efx_set_interrupt_affinity() affinity hint
>   i40e_vsi_request_irq_msix()  affinity hint
>
>   be_evt_queues_create()   affinity hint, queue affinity
>   hns3_nic_set_cpumask()   affinity hint, queue affinity
>   mlx4_en_init_affinity_hint() affinity hint, queue affinity
>   mlx4_en_create_tx_ring() affinity hint, queue affinity
>   set_comp_irq_affinity_hint() affinity hint, queue affinity
>   i40e_config_xps_tx_ring()affinity hint, queue affinity
>   
>   hclge_configure  affinity_hint, queue affinity, workqueue 
> selection
>
>   ixgbe_alloc_q_vector()   node selection, affinity hint, queue affinity
>
> All of them do not care about disabling hotplug. Taking cpu_read_lock()
> inside of that spread function would not solve anything because once the
> lock is dropped the CPU can go away.
>
> There are 3 classes of this:
>
>1) Does not matter: affinity hint
>
>2) Might fail to set up the network queue when the selected CPU
>   is offline.
>
>3) Broken: The hclge driver which uses the cpu to schedule work on
>   that cpu. That's broken, but unfortunately neither the workqueue
>   code nor the timer code will ever notice. The work just wont be
>   scheduled until the CPU comes online again which might be never.

Agreed.

> But looking at the above I really have to ask the question what the
> commit in question is actually trying to solve.
>
> AFAICT, nothing at all. Why?
>
>   1) The majority of the drivers sets the hint __after_ requesting the
>  interrupt
>
>   2) Even if set _before_ requesting the interrupt it does not solve
>  anything because it's a hint and the interrupt core code does
>  not care about it at all. It provides the storage and the procfs
>  interface nothing else.
>
> So how does that prevent the interrupt subsystem from assigning an
> interrupt to an isolated CPU? Not at all.
>
> Interrupts which are freshly allocated get the default interrupt
> affinity mask, which is either set on the command line or via /proc. The
> affinity of the interrupt can be changed after it has been populated in
> /proc.
>
> When the interrupt is requested then one of the online CPUs in it's
> affinity mask is chosen.
>
> X86 is special here because this also requires that there are free
> vectors on one of the online CPUs in the mask. If the CPUs in the
> affinity mask run out of vectors then it will grab a vector from some
> other CPU which might be an isolated CPU.
>
> When the affi

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-04 Thread Nitesh Narayan Lal


On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
> On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:

[...]

>>>>> Nitesh, is there anything preventing this from being fixed
>>>>> in userspace ? (as Thomas suggested previously).
>>>> Everything with is not managed can be steered by user space.
>>> Yes, but it seems to be racy (that is, there is a window where the 
>>> interrupt can be delivered to an isolated CPU).
>>>
>>> ethtool ->
>>> xgbe_set_channels ->
>>> xgbe_full_restart_dev ->
>>> xgbe_alloc_memory ->
>>> xgbe_alloc_channels ->
>>> cpumask_local_spread
>>>
>>> Also ifconfig eth0 down / ifconfig eth0 up leads
>>> to cpumask_spread_local.
>> There's always that possibility.
> Then there is a window where isolation can be broken.
>
>> We have to ensure that we move the IRQs by a tuned daemon or some other
>> userspace script every time there is a net-dev change (eg. device comes up,
>> creates VFs, etc).
> Again, race window open can result in interrupt to isolated CPU.


Yes, and if I am not mistaken then that always has been a problem with
these userspace solutions.

>
>>> How about adding a new flag for isolcpus instead?
>>>
>> Do you mean a flag based on which we can switch the affinity mask to
>> housekeeping for all the devices at the time of IRQ distribution?
> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>
>

Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

-- 
Thanks
Nitesh



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-04 Thread Nitesh Narayan Lal


On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
>> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
 The whole pile wants to be reverted. It's simply broken in several ways.
>>> I was asking for your comments on interaction with CPU hotplug :-)
>> Which I answered in an seperate mail :)
>>
>>> So housekeeping_cpumask has multiple meanings. In this case:
>> ...
>>
>>> So as long as the meaning of the flags are respected, seems
>>> alright.
>> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
>> when a affinity mask spawns housekeeping and isolated is perfectly
>> fine. It's well thought out and has no limitations.
>>
>>> Nitesh, is there anything preventing this from being fixed
>>> in userspace ? (as Thomas suggested previously).
>> Everything with is not managed can be steered by user space.
> Yes, but it seems to be racy (that is, there is a window where the 
> interrupt can be delivered to an isolated CPU).
>
> ethtool ->
> xgbe_set_channels ->
> xgbe_full_restart_dev ->
> xgbe_alloc_memory ->
> xgbe_alloc_channels ->
> cpumask_local_spread
>
> Also ifconfig eth0 down / ifconfig eth0 up leads
> to cpumask_spread_local.

There's always that possibility.

We have to ensure that we move the IRQs by a tuned daemon or some other
userspace script every time there is a net-dev change (eg. device comes up,
creates VFs, etc).

> How about adding a new flag for isolcpus instead?
>

Do you mean a flag based on which we can switch the affinity mask to
housekeeping for all the devices at the time of IRQ distribution?

-- 
Thanks
Nitesh



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-28 Thread Nitesh Narayan Lal


On 1/28/21 11:59 AM, Marcelo Tosatti wrote:
> On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
>> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> + mask = housekeeping_cpumask(hk_flags);
 AFAICS, this generally resolves to something based on cpu_possible_mask
 rather than cpu_online_mask as before, so could now potentially return an
 offline CPU. Was that an intentional change?
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>> The whole pile wants to be reverted. It's simply broken in several ways.
> I was asking for your comments on interaction with CPU hotplug :-)
> Anyway...
>
> So housekeeping_cpumask has multiple meanings. In this case:
>
> HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ
>
>  domain
>Isolate from the general SMP balancing and scheduling
>algorithms. Note that performing domain isolation this way
>is irreversible: it's not possible to bring back a CPU to
>the domains once isolated through isolcpus. It's strongly
>advised to use cpusets instead to disable scheduler load
>balancing through the "cpuset.sched_load_balance" file.
>It offers a much more flexible interface where CPUs can
>move in and out of an isolated set anytime.
>
>You can move a process onto or off an "isolated" CPU via
>the CPU affinity syscalls or cpuset.
> begins at 0 and the maximum value is
>"number of CPUs in system - 1".
>
>  managed_irq
>
>Isolate from being targeted by managed interrupts
>which have an interrupt mask containing isolated
>CPUs. The affinity of managed interrupts is
>handled by the kernel and cannot be changed via
>the /proc/irq/* interfaces.
>
>This isolation is best effort and only effective
>if the automatically assigned interrupt mask of a
>device queue contains isolated and housekeeping
>CPUs. If housekeeping CPUs are online then such
>interrupts are directed to the housekeeping CPU
>so that IO submitted on the housekeeping CPU
>cannot disturb the isolated CPU.
>
>If a queue's affinity mask contains only isolated
>CPUs then this parameter has no effect on the
>interrupt routing decision, though interrupts are
>only delivered when tasks running on those
>isolated CPUs submit IO. IO submitted on
>housekeeping CPUs has no influence on those
>queues.
>
> So as long as the meaning of the flags are respected, seems
> alright.
>
> Nitesh, is there anything preventing this from being fixed
> in userspace ? (as Thomas suggested previously).

I think it should be doable atleast for most of the devices.
However, I do wonder if there is a better way of fixing this generically
from the kernel?

Currently, as Thomas mentioned housekeeping_cpumask() is used at different
locations just to fix the issue corresponding to that component or driver.

-- 
Thanks
Nitesh



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Nitesh Narayan Lal


On 1/27/21 8:09 AM, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
>> On 2021-01-27 12:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
>>>> Hi,
>>>>
>>>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>>>> From: Alex Belits 
>>>>>
>>>>> The current implementation of cpumask_local_spread() does not respect the
>>>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>>>> it will return it to the caller for pinning of its IRQ threads. Having
>>>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>>>> overhead.
>>>>>
>>>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>>>> available housekeeping CPUs.
>>>>>
>>>>> Signed-off-by: Alex Belits 
>>>>> Signed-off-by: Nitesh Narayan Lal 
>>>>> ---
>>>>>lib/cpumask.c | 16 +++-
>>>>>1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>>>> index fb22fb266f93..85da6ab4fbb5 100644
>>>>> --- a/lib/cpumask.c
>>>>> +++ b/lib/cpumask.c
>>>>> @@ -6,6 +6,7 @@
>>>>>#include 
>>>>>#include 
>>>>>#include 
>>>>> +#include 
>>>>>/**
>>>>> * cpumask_next - get the next cpu in a cpumask
>>>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
>>>>> mask)
>>>>> */
>>>>>unsigned int cpumask_local_spread(unsigned int i, int node)
>>>>>{
>>>>> - int cpu;
>>>>> + int cpu, hk_flags;
>>>>> + const struct cpumask *mask;
>>>>> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> + mask = housekeeping_cpumask(hk_flags);
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>>
>> Robin.
> Nothing, but that was the situation before 
> 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.

Marcelo, before the commit cpumask_local_spread, was in fact, relying on
cpu_online_mask as Robin mentioned.
The problem here is with housekeeping_cpumask which always relied on the
cpu_possible_mask.

>
> cpumask_local_spread() should probably be disabling CPU hotplug.


Yes and this should also be done at several other places in the drivers
which don't take CPU hotplug into account eg. at the time of vector
allocation.


-- 
Thanks
Nitesh



Re: [PATCH] Revert "KVM: x86: Unconditionally enable irqs in guest context"

2021-01-12 Thread Nitesh Narayan Lal


On 1/7/21 4:33 AM, Vitaly Kuznetsov wrote:
> Sean Christopherson  writes:
>
>> On Wed, Jan 06, 2021, Vitaly Kuznetsov wrote:
>>> Looking back, I don't quite understand why we wanted to account ticks
>>> between vmexit and exiting guest context as 'guest' in the first place;
>>> to my understanging 'guest time' is time spent within VMX non-root
>>> operation, the rest is KVM overhead (system).
>> With tick-based accounting, if the tick IRQ is received after PF_VCPU is 
>> cleared
>> then that tick will be accounted to the host/system.  The motivation for 
>> opening
>> an IRQ window after VM-Exit is to handle the case where the guest is 
>> constantly
>> exiting for a different reason _just_ before the tick arrives, e.g. if the 
>> guest
>> has its tick configured such that the guest and host ticks get synchronized
>> in a bad way.
>>
>> This is a non-issue when using CONFIG_VIRT_CPU_ACCOUNTING_GEN=y, at least 
>> with a
>> stable TSC, as the accounting happens during guest_exit_irqoff() itself.
>> Accounting might be less-than-stellar if TSC is unstable, but I don't think 
>> it
>> would be as binary of a failure as tick-based accounting.
>>
> Oh, yea, I vaguely remember we had to deal with a very similar problem
> but for userspace/kernel accounting. It was possible to observe e.g. a
> userspace task going 100% kernel while in reality it was just perfectly
> synchronized with the tick and doing a syscall just before it arrives
> (or something like that, I may be misremembering the details).
>
> So depending on the frequency, it is probably possible to e.g observe
> '100% host' with tick based accounting, the guest just has to
> synchronize exiting to KVM in a way that the tick will always arrive
> past guest_exit_irqoff().
>
> It seems to me this is a fundamental problem in case the frequency of
> guest exits can match the frequency of the time accounting tick.
>

Just to make sure that I am understanding things correctly.
There are two issues:
1. The first issue is with the tick IRQs that arrive after PF_VCPU is
   cleared as they are then accounted into the system context atleast on
   the setup where CONFIG_VIRT_CPU_ACCOUNTING_GEN is not enabled. With the
   patch "KVM: x86: Unconditionally enable irqs in guest context", we are
   atleast taking care of the scenario where the guest context is exiting
   constantly just before the arrival of the tick.
 
2. The second issue that Sean mentioned was introduced because of moving
   guest_exit_irqoff() closer to VM-exit. Due to this change, any ticks that
   happen after IRQs are disabled are incorrectly accounted into the system
   context. This is because we exit the guest context early without
   ensuring if the required guest states to handle IRQs are restored.
 
So, the increase in the system time (reported by cpuacct.stats) that I was
observing is not entirely correct after all.
Am I missing anything here?

--
Thanks
Nitesh



Re: [PATCH] Revert "KVM: x86: Unconditionally enable irqs in guest context"

2021-01-05 Thread Nitesh Narayan Lal


On 1/5/21 7:47 PM, Sean Christopherson wrote:
> +tglx
>
> On Tue, Jan 05, 2021, Nitesh Narayan Lal wrote:
>> This reverts commit d7a08882a0a4b4e176691331ee3f492996579534.
>>
>> After the introduction of the patch:
>>
>>  87fa7f3e9: x86/kvm: Move context tracking where it belongs
>>
>> since we have moved guest_exit_irqoff closer to the VM-Exit, explicit
>> enabling of irqs to process pending interrupts should not be required
>> within vcpu_enter_guest anymore.
> Ugh, except that commit completely broke tick-based accounting, on both Intel
> and AMD.

I did notice some discrepancies in the system time reported after the
introduction of this patch but I wrongly concluded that the behavior is correct.

I reported this yesterday [1] but I think I added your old email ID in
that thread (sorry about that).

>   With guest_exit_irqoff() being called immediately after VM-Exit, any
> tick that happens after IRQs are disabled will be accounted to the host.  E.g.
> on Intel, even an IRQ VM-Exit that has already been acked by the CPU isn't
> processed until kvm_x86_ops.handle_exit_irqoff(), well after PF_VCPU has been
> cleared.

Right that also explains the higher system time reported by the cpuacct.stats.

>
> CONFIG_VIRT_CPU_ACCOUNTING_GEN=y should still work (I didn't bother to 
> verify).

For the cpuacct stats that I have shared in the other thread, this config was
enabled.

>
> Thomas, any clever ideas?  Handling IRQs in {vmx,svm}_vcpu_enter_exit() isn't 
> an
> option as KVM hasn't restored enough state to handle an IRQ, e.g. PKRU and 
> XCR0
> are still guest values.  Is it too heinous to fudge PF_VCPU across KVM's
> "pending" IRQ handling?  E.g. this god-awful hack fixes the accounting:
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 836912b42030..5a777fd35b4b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9028,6 +9028,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> vcpu->mode = OUTSIDE_GUEST_MODE;
> smp_wmb();
>  
> +   current->flags |= PF_VCPU;
> kvm_x86_ops.handle_exit_irqoff(vcpu);
>  
> /*
> @@ -9042,6 +9043,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> ++vcpu->stat.exits;
> local_irq_disable();
> kvm_after_interrupt(vcpu);
> +   current->flags &= ~PF_VCPU;
>  
> if (lapic_in_kernel(vcpu)) {
> s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
>

I can give this a try.
What is the right way to test this (via cpuacct stats maybe)?

[1] 
https://lore.kernel.org/lkml/12a1b9d4-8534-e23a-6bbd-736474928...@redhat.com/

-- 
Thanks
Nitesh



[PATCH] Revert "KVM: x86: Unconditionally enable irqs in guest context"

2021-01-05 Thread Nitesh Narayan Lal
This reverts commit d7a08882a0a4b4e176691331ee3f492996579534.

After the introduction of the patch:

87fa7f3e9: x86/kvm: Move context tracking where it belongs

since we have moved guest_exit_irqoff closer to the VM-Exit, explicit
enabling of irqs to process pending interrupts should not be required
within vcpu_enter_guest anymore.

Conflicts:
arch/x86/kvm/svm.c

Signed-off-by: Nitesh Narayan Lal 
---
 arch/x86/kvm/svm/svm.c |  9 +
 arch/x86/kvm/x86.c | 11 ---
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index cce0143a6f80..c9b2fbb32484 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4187,6 +4187,15 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
 
 static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
+   kvm_before_interrupt(vcpu);
+   local_irq_enable();
+   /*
+* We must have an instruction with interrupts enabled, so
+* the timer interrupt isn't delayed by the interrupt shadow.
+*/
+   asm("nop");
+   local_irq_disable();
+   kvm_after_interrupt(vcpu);
 }
 
 static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3f7c1fc7a3ce..3e17c9ffcad8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9023,18 +9023,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
kvm_x86_ops.handle_exit_irqoff(vcpu);
 
-   /*
-* Consume any pending interrupts, including the possible source of
-* VM-Exit on SVM and any ticks that occur between VM-Exit and now.
-* An instruction is required after local_irq_enable() to fully unblock
-* interrupts on processors that implement an interrupt shadow, the
-* stat.exits increment will do nicely.
-*/
-   kvm_before_interrupt(vcpu);
-   local_irq_enable();
++vcpu->stat.exits;
-   local_irq_disable();
-   kvm_after_interrupt(vcpu);
 
if (lapic_in_kernel(vcpu)) {
s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
-- 
2.27.0



Possible regression in cpuacct.stats system time

2021-01-04 Thread Nitesh Narayan Lal
Hi,

Last year I reported an issue of "suspicious RCU usage" [1] with the debug
kernel which was fixed with the patch:

    87fa7f3e98 "x86/kvm: Move context tracking where it belongs"

Recently I have come across a possible regression because of this
patch in the cpuacct.stats system time.

With the latest upstream kernel (5.11-rc2) when we set up a VM and start
observing the system time value from cpuacct.stat then it is significantly
higher than value reported with the kernel that doesn't have the
previously mentioned patch.

For instance, the following are the values of cpuacct.stats right after the
VM bring up completion for two cases:

with a kernel that has the patch-
    user 471
    system 6094

with the patch reverted-
    user 498
    system 1873


FWIU the reason behind this increase is the moving of guest_exit_irqoff()
to its proper location (near vmexit). This leads to the accounting
of instructions that were previously accounted into the guest context as a
part of the system time.

IMO this should be an expected behavior after the previously mentioned
change. Is that a right conclusion or I am missing something here?

Another question that I have is about the patch

    d7a08882a0 "KVM: x86: Unconditionally enable irqs in guest context"

considering we are enabling irqs early now in the code path, do we still
need this patch?


[1] 
https://lore.kernel.org/lkml/ece36eb1-253a-8ec6-c183-309c10bb3...@redhat.com/

--
Thanks
Nitesh



Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Nitesh Narayan Lal

On 10/26/20 5:50 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
>> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
>>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
 Are there drivers which use more than one interrupt per queue? I know
 drivers have multiple management interrupts.. and I guess some drivers
 do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
 have multiple queues for one interrupt .. I'm not sure how a single
 queue with multiple interrupts would work though.
>>> For block there is always one interrupt per queue. Some Network drivers
>>> seem to have seperate RX and TX interrupts per queue.
>> That's true when thinking of Tx and Rx as a single queue. Another way to
>> think about it is "one rx queue" and "one tx queue" each with their own
>> interrupt...
>>
>> Even if there are devices which force there to be exactly queue pairs,
>> you could still think of them as separate entities?
> Interesting thought.
>
> But as Jakub explained networking queues are fundamentally different
> from block queues on the RX side. For block the request issued on queue
> X will raise the complete interrupt on queue X.
>
> For networking the TX side will raise the TX interrupt on the queue on
> which the packet was queued obviously or should I say hopefully. :)

This is my impression as well.

> But incoming packets will be directed to some receive queue based on a
> hash or whatever crystallball logic the firmware decided to implement.
>
> Which makes this not really suitable for the managed interrupt and
> spreading approach which is used by block-mq. Hrm...
>
> But I still think that for curing that isolation stuff we want at least
> some information from the driver. Alternative solution would be to grant
> the allocation of interrupts and queues and have some sysfs knob to shut
> down queues at runtime. If that shutdown results in releasing the queue
> interrupt (via free_irq()) then the vector exhaustion problem goes away.

I think this is close to what I and Marcelo were discussing earlier today
privately.

I don't think there is currently a way to control the enablement/disablement of
interrupts from the userspace.

I think in terms of the idea we need something similar to what i40e does,
that is shutdown all IRQs when CPU is suspended and restores the interrupt
schema when the CPU is back online.

The two key difference will be that this API needs to be generic and also
needs to be exposed to the userspace through something like sysfs as you
have mentioned.

-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Nitesh Narayan Lal

On 10/23/20 5:00 PM, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
>> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>>> So shouldn't we then fix the drivers / interface first, to get rid of
>>> this inconsistency?
>>>
>> Considering we agree that excess vector is a problem that needs to be
>> solved across all the drivers and that you are comfortable with the other
>> three patches in the set. If I may suggest the following:
>>
>> - We can pick those three patches for now, as that will atleast fix a
>>   driver that is currently impacting RT workloads. Is that a fair
>>   expectation?
> No. Blindly reducing the maximum vectors to the number of housekeeping
> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
> what the right number of interrupts for this situation is.
>
> Many of these drivers need more than queue interrupts, admin, error
> interrupt and some operate best with seperate RX/TX interrupts per
> queue. They all can "work" with a single PCI interrupt of course, but
> the price you pay is performance.
>
> An isolated setup, which I'm familiar with, has two housekeeping
> CPUs. So far I restricted the number of network queues with a module
> argument to two, which allocates two management interrupts for the
> device and two interrupts (RX/TX) per queue, i.e. a total of six.

Does it somehow take num_online_cpus() into consideration while deciding
the number of interrupts to create?

>
> Now I reduced the number of available interrupts to two according to
> your hack, which makes it use one queue RX/TX combined and one
> management interrupt. Guess what happens? Network performance tanks to
> the points that it breaks a carefully crafted setup.
>
> The same applies to a device which is application specific and wants one
> channel including an interrupt per isolated application core. Today I
> can isolate 8 out of 12 CPUs and let the device create 8 channels and
> set one interrupt and channel affine to each isolated CPU. With your
> hack, I get only 4 interrupts and channels. Fail!
>
> You cannot declare that all this is perfectly fine, just because it does
> not matter for your particular use case.

I agree that does sound wrong.

>
> So without information from the driver which tells what the best number
> of interrupts is with a reduced number of CPUs, this cutoff will cause
> more problems than it solves. Regressions guaranteed.

Indeed.
I think one commonality among the drivers at the moment is the usage of
num_online_cpus() to determine the vectors to create.

So, maybe instead of doing this kind of restrictions in a generic level
API, it will make more sense to do this on a per-device basis by replacing
the number of online CPUs with the housekeeping CPUs?

This is what I have done in the i40e patch.
But that still sounds hackish and will impact the performance.

>
> Managed interrupts base their interrupt allocation and spreading on
> information which is handed in by the individual driver and not on crude
> assumptions. They are not imposing restrictions on the use case.

Right, FWIU it is irq_do_set_affinity that prevents the spreading of
managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,
isn't?

>
> It's perfectly fine for isolated work to save a data set to disk after
> computation has finished and that just works with the per-cpu I/O queue
> which is otherwise completely silent. All isolated workers can do the
> same in parallel without trampling on each other toes by competing for a
> reduced number of queues which are affine to the housekeeper CPUs.
>
> Unfortunately network multi-queue is substantially different from block
> multi-queue (as I learned in this conversation), so the concept cannot
> be applied one-to-one to networking as is. But there are certainly part
> of it which can be reused.

So this is one of the areas that I don't understand well yet and will
explore now.

>
> This needs a lot more thought than just these crude hacks.

Got it.
I am indeed not in the favor of pushing a solution that has a possibility
of regressing/breaking things afterward.

>
> Especially under the aspect that there are talks about making isolation
> runtime switchable. Are you going to rmmod/insmod the i40e network
> driver to do so? That's going to work fine if you do that
> reconfiguration over network...

That's an interesting point. However, for some of the drivers which uses
something like cpu_online/possible_mask while creating its affinity mask
reloading will still associate jobs to the isolated CPUs.


-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

2020-10-23 Thread Nitesh Narayan Lal

On 10/23/20 9:29 AM, Frederic Weisbecker wrote:
> On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
>>> Extend nohz_full feature set to include isolation from managed IRQS. This
>> So you say it's for managed-irqs, the feature is actually called
>> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.
>>
>> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
>> fine and don't need help at at...
>>
>>> is required specifically for setups that only uses nohz_full and still
>>> requires isolation for maintaining lower latency for the listed CPUs.
>>>
>>> Suggested-by: Frederic Weisbecker 
> Ah and yes there is this tag :-p
>
> So that's my bad, I really thought this thing was about managed IRQ.
> The problem is that I can't find a single documentation about them so I'm
> too clueless on that matter.

I am also confused with this terminology.
So my bad for not taking care of this.

-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

2020-10-23 Thread Nitesh Narayan Lal

On 10/23/20 9:25 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
>> Extend nohz_full feature set to include isolation from managed IRQS. This
> So you say it's for managed-irqs, the feature is actually called
> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.

Ah my bad! I should replace the managed IRQS with MANAGED_IRQ.
I can send another version with this fixed.

>
> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
> fine and don't need help at at...

Since the introduction of
"genirq, sched/isolation: Isolate from handling managed interrupts"

Within irq_do_set_affinity(), it is ensured that for managed intrrupts as
well, the isolated CPUs are removed from the affinity mask.

Hence, IMHO before this change managed interrupts were affecting the
isolated CPUs.

My intent of having this change is to basically allow isolation for
nohz_full CPUs even when we don't have something like isolcpus.
Does that make sense?


>
>> is required specifically for setups that only uses nohz_full and still
>> requires isolation for maintaining lower latency for the listed CPUs.
>>
>> Suggested-by: Frederic Weisbecker 
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  kernel/sched/isolation.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 5a6ea03f9882..9df9598a9e39 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
>>  unsigned int flags;
>>  
>>  flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
>> -HK_FLAG_MISC | HK_FLAG_KTHREAD;
>> +HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>>  
>>  return housekeeping_setup(str, flags);
>>  }
>> -- 
>> 2.18.2
>>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-23 Thread Nitesh Narayan Lal

On 10/23/20 4:58 AM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>
>> Hi Peter,
>>
>> So based on the suggestions from you and Thomas, I think something like the
>> following should do the job within pci_alloc_irq_vectors_affinity():
>>
>> +       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
>> +               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
>>
>> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
>> and to an extent I agree that it does degrade the code clarity.
> It's not just code clarity; I simply don't understand it. It feels like
> a band-aid that breaks thing.
>
> At the very least it needs a ginormous (and coherent) comment that
> explains:
>
>  - the interface
>  - the usage
>  - this hack

That make sense.

>
>> However, since there is a certain inconsistency in the number of vectors
>> that drivers request through this API IMHO we will need this, otherwise
>> we could cause an impact on the drivers even in setups that doesn't
>> have any isolated CPUs.
> So shouldn't we then fix the drivers / interface first, to get rid of
> this inconsistency?
>

Considering we agree that excess vector is a problem that needs to be
solved across all the drivers and that you are comfortable with the other
three patches in the set. If I may suggest the following:

- We can pick those three patches for now, as that will atleast fix a
  driver that is currently impacting RT workloads. Is that a fair
  expectation?

- In the meanwhile, I will start looking into individual drivers that
  consume this API to find out if there is a co-relation that can be
  derived between the max_vecs and number of CPUs. If that exists then I
  can go ahead and tweak the API's max_vecs accordingly. However, if this
  is absolutely random then I can come up with a sane comment
  before this check that covers the list of items you suggested.

- I also want to explore the comments made by Thomas which may take
  some time.


-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-22 Thread Nitesh Narayan Lal

On 10/20/20 10:39 AM, Nitesh Narayan Lal wrote:
> On 10/20/20 9:41 AM, Peter Zijlstra wrote:
>> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>>> So I think it is important to figure out what that driver really wants
>>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>>> @max_vecs to something that expresses the N*nr_cpus relation.
>>> Reading Marcelo's comment again I think what is really expected is 1
>>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
>> Then what is the point of them asking for N*nr_cpus when there is no
>> isolation?
>>
>> Either everybody wants 1 interrupts per CPU and we can do the clamp
>> unconditionally, in which case we should go fix this user, or they want
>> multiple per cpu and we should go fix the interface.
>>
>> It cannot be both.
> Based on my understanding I don't think this is consistent, the number
> of interrupts any driver can request varies to an extent that some
> consumer of this API even request just one interrupt for its use.
>
> This was one of the reasons why I thought of having a conditional
> restriction.
>
> But I agree there is a lack of consistency.
>

Hi Peter,

So based on the suggestions from you and Thomas, I think something like the
following should do the job within pci_alloc_irq_vectors_affinity():

+       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
+               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
and to an extent I agree that it does degrade the code clarity.

However, since there is a certain inconsistency in the number of vectors
that drivers request through this API IMHO we will need this, otherwise
we could cause an impact on the drivers even in setups that doesn't
have any isolated CPUs.

If you agree, I can send the next version of the patch-set.

-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-21 Thread Nitesh Narayan Lal

On 10/21/20 4:25 PM, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>>> However, IMHO we would still need a logic to prevent the devices from
>>> creating excess vectors.
>> Managed interrupts are preventing exactly that by pinning the interrupts
>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>> CPU hotplug.
>>
>> Non-managed, yes that is and always was a problem. One of the reasons
>> why managed interrupts exist.
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
>
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.

Yes, that is indeed the case.

>
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
>
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
>
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

IIRC then i40e does have something like that where it suspends all IRQs
before hibernation and restores them when the CPU is back online.

I am not particularly sure about the other drivers.

This brings me to another discussion that Peter initiated that is to
perform the proposed restriction without any condition for all non-managed
IRQs.

Something on the lines:

+   if (!pci_is_managed(dev))
+   max_vecs = clamp(hk_cpus, min_vecs, max_vecs);


I am not particularly sure about this because I am not sure what kind of
performance penalty this will have on the drivers in general and if
that will be acceptable at all. Any thoughts?

However, this still doesn't solve the generic problem, and an ideal solution
will be something that you suggested.

Will it be sensible to think about having a generic API that can be
consumed by all the drivers and that can do both the things you mentioned?

>
> Thanks,
>
> tglx
>
>
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Nitesh Narayan Lal

On 10/20/20 10:16 AM, Thomas Gleixner wrote:
> On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>>  
>> +hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +/*
>> + * If we have isolated CPUs for use by real-time tasks, to keep the
>> + * latency overhead to a minimum, device-specific IRQ vectors are moved
>> + * to the housekeeping CPUs from the userspace by changing their
>> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> + * running out of IRQ vectors.
>> + */
> This is not true for managed interrupts. The interrupts affinity of
> those cannot be changed by user space.

Ah Yes. Perhaps
s/IRQs/non-managed IRQ ?

>
>> +if (hk_cpus < num_online_cpus()) {
>> +if (hk_cpus < min_vecs)
>> +max_vecs = min_vecs;
>> +else if (hk_cpus < max_vecs)
>> +max_vecs = hk_cpus;
>> +}
> So now with that assume a 16 core machine (HT off for simplicity)
>
> 17 Requested interrupts (1 general, 16 queues)
>
> Managed interrupts will allocate
>
>1  general interrupt which is free movable by user space
>16 managed interrupts for queues (one per CPU)
>
> This allows the driver to have 16 queues, i.e. one queue per CPU. These
> interrupts are only used when an application on a CPU issues I/O.

Correct.

>
> With the above change this will result
>
>1  general interrupt which is free movable by user space
>1  managed interrupts (possible affinity to all 16 CPUs, but routed
>   to housekeeping CPU as long as there is one online)
>
> So the device is now limited to a single queue which also affects the
> housekeeping CPUs because now they have to share a single queue.
>
> With larger machines this gets even worse.

Yes, the change can impact the performance, however, if we don't do that we
may have a latency impact instead. Specifically, on larger systems where
most of the CPUs are isolated as we will definitely fail in moving all of the
IRQs away from the isolated CPUs to the housekeeping.

>
> So no. This needs way more thought for managed interrupts and you cannot
> do that at the PCI layer.

Maybe we should not be doing anything in the case of managed IRQs as they
are anyways pinned to the housekeeping CPUs as long as we have the
'managed_irq' option included in the kernel cmdline.

>  Only the affinity spreading mechanism can do
> the right thing here.

I can definitely explore this further.

However, IMHO we would still need a logic to prevent the devices from
creating excess vectors.

Do you agree?

>
> Thanks,
>
> tglx
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Nitesh Narayan Lal

On 10/20/20 9:41 AM, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>> So I think it is important to figure out what that driver really wants
>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>> @max_vecs to something that expresses the N*nr_cpus relation.
>> Reading Marcelo's comment again I think what is really expected is 1
>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
> Then what is the point of them asking for N*nr_cpus when there is no
> isolation?
>
> Either everybody wants 1 interrupts per CPU and we can do the clamp
> unconditionally, in which case we should go fix this user, or they want
> multiple per cpu and we should go fix the interface.
>
> It cannot be both.

Based on my understanding I don't think this is consistent, the number
of interrupts any driver can request varies to an extent that some
consumer of this API even request just one interrupt for its use.

This was one of the reasons why I thought of having a conditional
restriction.

But I agree there is a lack of consistency.

-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Nitesh Narayan Lal

On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>> So I think it is important to figure out what that driver really wants
>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>> only reduce the number of CPUs, the proposed interface is wrong.
>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Then the patch is wrong and the interface needs changing from @min_vecs,
> @max_vecs to something that expresses the N*nr_cpus relation.

Reading Marcelo's comment again I think what is really expected is 1
interrupt per non-isolated (housekeeping) CPU (not N interrupts).

My bad that I missed it initially.

-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-19 Thread Nitesh Narayan Lal

On 10/19/20 10:00 AM, Marcelo Tosatti wrote:
> On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
>> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:

[...]

>>>> Also, do we really need to have that conditional on hk_cpus <
>>>> num_online_cpus()? That is, why can't we do this unconditionally?
>>> FWIU most of the drivers using this API already restricts the number of
>>> vectors based on the num_online_cpus, if we do it unconditionally we can
>>> unnecessary duplicate the restriction for cases where we don't have any
>>> isolated CPUs.
>> unnecessary isn't really a concern here, this is a slow path. What's
>> important is code clarity.

Right, I can skip that check then.

>>
>>> Also, different driver seems to take different factors into consideration
>>> along with num_online_cpus while finding the max_vecs to request, for
>>> example in the case of mlx5:
>>> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>>>    MLX5_EQ_VEC_COMP_BASE
>>>
>>> Having hk_cpus < num_online_cpus() helps us ensure that we are only
>>> changing the behavior when we have isolated CPUs.
>>>
>>> Does that make sense?
>> That seems to want to allocate N interrupts per cpu (plus some random
>> static amount, which seems weird, but whatever). This patch breaks that.
> On purpose. For the isolated CPUs we don't want network device 
> interrupts (in this context).
>
>> So I think it is important to figure out what that driver really wants
>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>> only reduce the number of CPUs, the proposed interface is wrong.
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Zero interrupts for isolated interrupts.

Right, otherwise we may end up in a situation where we run out of per CPU
vectors while we move the IRQs from isolated CPUs to housekeeping.


-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-18 Thread Nitesh Narayan Lal

On 10/16/20 8:20 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
>> If we have isolated CPUs dedicated for use by real-time tasks, we try to
>> move IRQs to housekeeping CPUs from the userspace to reduce latency
>> overhead on the isolated CPUs.
>>
>> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
>> may exceed per-CPU vector limits.
>>
>> When we have isolated CPUs, limit the number of vectors allocated by
>> pci_alloc_irq_vectors() to the minimum number required by the driver, or
>> to one per housekeeping CPU if that is larger.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  drivers/pci/msi.c | 18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 30ae4ffda5c1..8c156867803c 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -23,6 +23,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include "pci.h"
>>  
>> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev 
>> *dev, unsigned int min_vecs,
>> struct irq_affinity *affd)
>>  {
>>  struct irq_affinity msi_default_affd = {0};
>> +unsigned int hk_cpus;
>>  int nvecs = -ENOSPC;
>>  
>> +hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +/*
>> + * If we have isolated CPUs for use by real-time tasks, to keep the
>> + * latency overhead to a minimum, device-specific IRQ vectors are moved
>> + * to the housekeeping CPUs from the userspace by changing their
>> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> + * running out of IRQ vectors.
>> + */
>> +if (hk_cpus < num_online_cpus()) {
>> +if (hk_cpus < min_vecs)
>> +max_vecs = min_vecs;
>> +else if (hk_cpus < max_vecs)
>> +max_vecs = hk_cpus;
> is that:
>
>   max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Yes, I think this will do.

>
> Also, do we really need to have that conditional on hk_cpus <
> num_online_cpus()? That is, why can't we do this unconditionally?


FWIU most of the drivers using this API already restricts the number of
vectors based on the num_online_cpus, if we do it unconditionally we can
unnecessary duplicate the restriction for cases where we don't have any
isolated CPUs.

Also, different driver seems to take different factors into consideration
along with num_online_cpus while finding the max_vecs to request, for
example in the case of mlx5:
MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
   MLX5_EQ_VEC_COMP_BASE

Having hk_cpus < num_online_cpus() helps us ensure that we are only
changing the behavior when we have isolated CPUs.

Does that make sense?

>
> And what are the (desired) semantics vs hotplug? Using a cpumask without
> excluding hotplug is racy.

The housekeeping_mask should still remain constant, isn't?
In any case, I can double check this.

>
>> +}
>> +
>>  if (flags & PCI_IRQ_AFFINITY) {
>>  if (!affd)
>>  affd = &msi_default_affd;
>> -- 
>> 2.18.2
>>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs

2020-10-08 Thread Nitesh Narayan Lal

On 10/1/20 11:49 AM, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
>> Nitesh Narayan Lal (4):
>>   sched/isolation: API to get number of housekeeping CPUs
>>   sched/isolation: Extend nohz_full to isolate managed IRQs
>>   i40e: Limit msix vectors to housekeeping CPUs
>>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
>>
>>  drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
>>  drivers/pci/msi.c   | 18 ++
>>  include/linux/sched/isolation.h |  9 +
>>  kernel/sched/isolation.c|  2 +-
>>  4 files changed, 30 insertions(+), 2 deletions(-)
> Acked-by: Frederic Weisbecker 
>
> Peter, if you're ok with the set, I guess this should go through
> the scheduler tree?
>
> Thanks.

Hi Peter,

I wanted follow up to check if you have any concerns/suggestions that I
should address in this patch-set before this can be picked?

-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-05 Thread Nitesh Narayan Lal

On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> On Sun, Oct 04, 2020 at 02:44:39PM +, Alex Belits wrote:
>> On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
>>> External Email
>>>
>>> ---
>>> ---
>>> On Wed, Jul 22, 2020 at 02:49:49PM +, Alex Belits wrote:
 +/*
 + * Description of the last two tasks that ran isolated on a given
 CPU.
 + * This is intended only for messages about isolation breaking. We
 + * don't want any references to actual task while accessing this
 from
 + * CPU that caused isolation breaking -- we know nothing about
 timing
 + * and don't want to use locking or RCU.
 + */
 +struct isol_task_desc {
 +  atomic_t curr_index;
 +  atomic_t curr_index_wr;
 +  boolwarned[2];
 +  pid_t   pid[2];
 +  pid_t   tgid[2];
 +  charcomm[2][TASK_COMM_LEN];
 +};
 +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
>>> So that's quite a huge patch that would have needed to be split up.
>>> Especially this tracing engine.
>>>
>>> Speaking of which, I agree with Thomas that it's unnecessary. It's
>>> too much
>>> code and complexity. We can use the existing trace events and perform
>>> the
>>> analysis from userspace to find the source of the disturbance.
>> The idea behind this is that isolation breaking events are supposed to
>> be known to the applications while applications run normally, and they
>> should not require any analysis or human intervention to be handled.
> Sure but you can use trace events for that. Just trace interrupts, workqueues,
> timers, syscalls, exceptions and scheduler events and you get all the local
> disturbance. You might want to tune a few filters but that's pretty much it.
>
> As for the source of the disturbances, if you really need that information,
> you can trace the workqueue and timer queue events and just filter those that
> target your isolated CPUs.
>

I agree that we can do all those things with tracing.
However, IMHO having a simplified logging mechanism to gather the source of
violation may help in reducing the manual effort.

Although, I am not sure how easy will it be to maintain such an interface
over time.

--
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


[PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-09-28 Thread Nitesh Narayan Lal
If we have isolated CPUs dedicated for use by real-time tasks, we try to
move IRQs to housekeeping CPUs from the userspace to reduce latency
overhead on the isolated CPUs.

If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
may exceed per-CPU vector limits.

When we have isolated CPUs, limit the number of vectors allocated by
pci_alloc_irq_vectors() to the minimum number required by the driver, or
to one per housekeeping CPU if that is larger.

Signed-off-by: Nitesh Narayan Lal 
---
 drivers/pci/msi.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 30ae4ffda5c1..8c156867803c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pci.h"
 
@@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, 
unsigned int min_vecs,
   struct irq_affinity *affd)
 {
struct irq_affinity msi_default_affd = {0};
+   unsigned int hk_cpus;
int nvecs = -ENOSPC;
 
+   hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
+
+   /*
+* If we have isolated CPUs for use by real-time tasks, to keep the
+* latency overhead to a minimum, device-specific IRQ vectors are moved
+* to the housekeeping CPUs from the userspace by changing their
+* affinity mask. Limit the vector usage to keep housekeeping CPUs from
+* running out of IRQ vectors.
+*/
+   if (hk_cpus < num_online_cpus()) {
+   if (hk_cpus < min_vecs)
+   max_vecs = min_vecs;
+   else if (hk_cpus < max_vecs)
+   max_vecs = hk_cpus;
+   }
+
if (flags & PCI_IRQ_AFFINITY) {
if (!affd)
affd = &msi_default_affd;
-- 
2.18.2



[PATCH v4 1/4] sched/isolation: API to get number of housekeeping CPUs

2020-09-28 Thread Nitesh Narayan Lal
Introduce a new API housekeeping_num_online_cpus(), that can be used to
retrieve the number of online housekeeping CPUs based on the housekeeping
flag passed by the caller.

Some of the consumers for this API are the device drivers that were
previously relying only on num_online_cpus() to determine the number of
MSIX vectors to create. In real-time environments to minimize interruptions
to isolated CPUs, all device-specific IRQ vectors are often moved to the
housekeeping CPUs, having excess vectors could cause housekeeping CPU to
run out of IRQ vectors.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/sched/isolation.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..e021b1846c1d 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -57,4 +57,13 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags 
flags)
return true;
 }
 
+static inline unsigned int housekeeping_num_online_cpus(enum hk_flags flags)
+{
+#ifdef CONFIG_CPU_ISOLATION
+   if (static_branch_unlikely(&housekeeping_overridden))
+   return cpumask_weight(housekeeping_cpumask(flags));
+#endif
+   return num_online_cpus();
+}
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
-- 
2.18.2



[PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

2020-09-28 Thread Nitesh Narayan Lal
Extend nohz_full feature set to include isolation from managed IRQS. This
is required specifically for setups that only uses nohz_full and still
requires isolation for maintaining lower latency for the listed CPUs.

Suggested-by: Frederic Weisbecker 
Signed-off-by: Nitesh Narayan Lal 
---
 kernel/sched/isolation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..9df9598a9e39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
unsigned int flags;
 
flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
-   HK_FLAG_MISC | HK_FLAG_KTHREAD;
+   HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
 
return housekeeping_setup(str, flags);
 }
-- 
2.18.2



[PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs

2020-09-28 Thread Nitesh Narayan Lal
If we have isolated CPUs designated to perform real-time tasks, to keep the
latency overhead to a minimum for real-time CPUs IRQ vectors are moved to
housekeeping CPUs from the userspace. Creating MSIX vectors only based on
the online CPUs could lead to exhaustion of housekeeping CPU IRQ vectors in
such environments.

This patch prevents i40e to create vectors only based on online CPUs by
retrieving the online housekeeping CPUs that are designated to perform
managed IRQ jobs.

Signed-off-by: Nitesh Narayan Lal 
Reviewed-by: Marcelo Tosatti 
Acked-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2e433fdbf2c3..370b581cd48c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Local includes */
@@ -11002,7 +11003,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 * will use any remaining vectors to reach as close as we can to the
 * number of online CPUs.
 */
-   cpus = num_online_cpus();
+   cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
vectors_left -= pf->num_lan_msix;
 
-- 
2.18.2



[PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs

2020-09-28 Thread Nitesh Narayan Lal
gaas).

Changes from v2[3]:
==
- Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep
  the naming convention consistent (based on a suggestion from Peter
  Zijlstra and Frederic Weisbecker).
- Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API
  to make it more usable in different use-cases (based on a suggestion from 
  Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
  from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's
  suggestion.

Changes from v1[4]:
==
Patch1:                                                                       
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
  using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
  of online housekeeping CPUs. This is based on Frederic Weisbecker's
  suggestion.           
- Since the hk_num_online_cpus() is self-explanatory, got rid of             
  the comment that was added previously.                                     
Patch2:                                                                       
- Added a new patch that is meant to enable managed IRQ isolation for
  nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.            
  
Patch4 (PCI):                                                                 
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
  of skipping modification to max_vecs, started restricting it based on the
  min_vecs. This is based on a suggestion from Marcelo Tosatti.                 
                                                   


[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/20200925182654.224004-1-nit...@redhat.com/
[3] https://lore.kernel.org/lkml/20200923181126.223766-1-nit...@redhat.com/
[4] https://lore.kernel.org/lkml/20200909150818.313699-1-nit...@redhat.com/


Nitesh Narayan Lal (4):
  sched/isolation: API to get number of housekeeping CPUs
  sched/isolation: Extend nohz_full to isolate managed IRQs
  i40e: Limit msix vectors to housekeeping CPUs
  PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
 drivers/pci/msi.c   | 18 ++
 include/linux/sched/isolation.h |  9 +
 kernel/sched/isolation.c|  2 +-
 4 files changed, 30 insertions(+), 2 deletions(-)

-- 




Re: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-09-25 Thread Nitesh Narayan Lal

On 9/25/20 5:38 PM, Nitesh Narayan Lal wrote:
> On 9/25/20 4:23 PM, Bjorn Helgaas wrote:

[...]

>>> +   /*
>>> +* If we have isolated CPUs for use by real-time tasks, to keep the
>>> +* latency overhead to a minimum, device-specific IRQ vectors are moved
>>> +* to the housekeeping CPUs from the userspace by changing their
>>> +* affinity mask. Limit the vector usage to keep housekeeping CPUs from
>>> +* running out of IRQ vectors.
>>> +*/
>>> +   if (hk_cpus < num_online_cpus()) {
>>> +   if (hk_cpus < min_vecs)
>>> +   max_vecs = min_vecs;
>>> +   else if (hk_cpus < max_vecs)
>>> +   max_vecs = hk_cpus;
>>> +   }
>> It seems like you'd want to do this inside
>> pci_alloc_irq_vectors_affinity() since that's an exported interface,
>> and drivers that use it will bypass the limiting you're doing here.
> Good point, few drivers directly use this.
> I took a quick look and it seems I may also have to take the pre and the
> post vectors into consideration.
>

It seems my initial interpretation was incorrect, reserved vecs (pre+post)
should always be less than the min_vecs.
So, FWIU we should be fine in limiting the max_vecs.

I can make this change and do a repost.

Do you have any other concerns with this patch or with any of the other
patches?

[...]

-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-09-25 Thread Nitesh Narayan Lal

On 9/25/20 4:23 PM, Bjorn Helgaas wrote:
> On Fri, Sep 25, 2020 at 02:26:54PM -0400, Nitesh Narayan Lal wrote:
>> If we have isolated CPUs dedicated for use by real-time tasks, we try to
>> move IRQs to housekeeping CPUs from the userspace to reduce latency
>> overhead on the isolated CPUs.
>>
>> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
>> may exceed per-CPU vector limits.
>>
>> When we have isolated CPUs, limit the number of vectors allocated by
>> pci_alloc_irq_vectors() to the minimum number required by the driver, or
>> to one per housekeeping CPU if that is larger.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/pci.h | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 835530605c0d..a7b10240b778 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -38,6 +38,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -1797,6 +1798,22 @@ static inline int
>>  pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>>unsigned int max_vecs, unsigned int flags)
>>  {
>> +unsigned int hk_cpus;
>> +
>> +hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> Add blank line here before the block comment.
>
>> +/*
>> + * If we have isolated CPUs for use by real-time tasks, to keep the
>> + * latency overhead to a minimum, device-specific IRQ vectors are moved
>> + * to the housekeeping CPUs from the userspace by changing their
>> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> + * running out of IRQ vectors.
>> + */
>> +if (hk_cpus < num_online_cpus()) {
>> +if (hk_cpus < min_vecs)
>> +max_vecs = min_vecs;
>> +else if (hk_cpus < max_vecs)
>> +max_vecs = hk_cpus;
>> +}
> It seems like you'd want to do this inside
> pci_alloc_irq_vectors_affinity() since that's an exported interface,
> and drivers that use it will bypass the limiting you're doing here.

Good point, few drivers directly use this.
I took a quick look and it seems I may also have to take the pre and the
post vectors into consideration.

>>  return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
>>NULL);
>>  }
>> -- 
>> 2.18.2
>>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


[PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-09-25 Thread Nitesh Narayan Lal
If we have isolated CPUs dedicated for use by real-time tasks, we try to
move IRQs to housekeeping CPUs from the userspace to reduce latency
overhead on the isolated CPUs.

If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
may exceed per-CPU vector limits.

When we have isolated CPUs, limit the number of vectors allocated by
pci_alloc_irq_vectors() to the minimum number required by the driver, or
to one per housekeeping CPU if that is larger.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/pci.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 835530605c0d..a7b10240b778 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1797,6 +1798,22 @@ static inline int
 pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
  unsigned int max_vecs, unsigned int flags)
 {
+   unsigned int hk_cpus;
+
+   hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
+   /*
+* If we have isolated CPUs for use by real-time tasks, to keep the
+* latency overhead to a minimum, device-specific IRQ vectors are moved
+* to the housekeeping CPUs from the userspace by changing their
+* affinity mask. Limit the vector usage to keep housekeeping CPUs from
+* running out of IRQ vectors.
+*/
+   if (hk_cpus < num_online_cpus()) {
+   if (hk_cpus < min_vecs)
+   max_vecs = min_vecs;
+   else if (hk_cpus < max_vecs)
+   max_vecs = hk_cpus;
+   }
return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
  NULL);
 }
-- 
2.18.2



[PATCH v3 1/4] sched/isolation: API to get number of housekeeping CPUs

2020-09-25 Thread Nitesh Narayan Lal
Introduce a new API housekeeping_num_online_cpus(), that can be used to
retrieve the number of online housekeeping CPUs based on the housekeeping
flag passed by the caller.

Some of the consumers for this API are the device drivers that were
previously relying only on num_online_cpus() to determine the number of
MSIX vectors to create. In real-time environments to minimize interruptions
to isolated CPUs, all device-specific IRQ vectors are often moved to the
housekeeping CPUs, having excess vectors could cause housekeeping CPU to
run out of IRQ vectors.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/sched/isolation.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..e021b1846c1d 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -57,4 +57,13 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags 
flags)
return true;
 }
 
+static inline unsigned int housekeeping_num_online_cpus(enum hk_flags flags)
+{
+#ifdef CONFIG_CPU_ISOLATION
+   if (static_branch_unlikely(&housekeeping_overridden))
+   return cpumask_weight(housekeeping_cpumask(flags));
+#endif
+   return num_online_cpus();
+}
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
-- 
2.18.2



[PATCH v3 0/4] isolation: limit msix vectors to housekeeping CPUs

2020-09-25 Thread Nitesh Narayan Lal
ekeeping_num_online_cpus() API
  to make it more usable in different use-cases (based on a suggestion from 
  Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
  from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's
  suggestion.

Changes from v1[3]:
==
Patch1:                                                                       
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
  using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
  of online housekeeping CPUs. This is based on Frederic Weisbecker's
  suggestion.           
- Since the hk_num_online_cpus() is self-explanatory, got rid of             
  the comment that was added previously.                                     
Patch2:                                                                       
- Added a new patch that is meant to enable managed IRQ isolation for
  nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.            
  
Patch4 (PCI):                                                                 
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
  of skipping modification to max_vecs, started restricting it based on the
  min_vecs. This is based on a suggestion from Marcelo Tosatti.                 
                                                   


[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/20200923181126.223766-1-nit...@redhat.com/
[3] https://lore.kernel.org/lkml/20200909150818.313699-1-nit...@redhat.com/


Nitesh Narayan Lal (4):
  sched/isolation: API to get number of housekeeping CPUs
  sched/isolation: Extend nohz_full to isolate managed IRQs
  i40e: Limit msix vectors to housekeeping CPUs
  PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
 include/linux/pci.h | 17 +
 include/linux/sched/isolation.h |  9 +
 kernel/sched/isolation.c|  2 +-
 4 files changed, 29 insertions(+), 2 deletions(-)

-- 




[PATCH v3 3/4] i40e: Limit msix vectors to housekeeping CPUs

2020-09-25 Thread Nitesh Narayan Lal
If we have isolated CPUs designated to perform real-time tasks, to keep the
latency overhead to a minimum for real-time CPUs IRQ vectors are moved to
housekeeping CPUs from the userspace. Creating MSIX vectors only based on
the online CPUs could lead to exhaustion of housekeeping CPU IRQ vectors in
such environments.

This patch prevents i40e to create vectors only based on online CPUs by
retrieving the online housekeeping CPUs that are designated to perform
managed IRQ jobs.

Signed-off-by: Nitesh Narayan Lal 
Reviewed-by: Marcelo Tosatti 
Acked-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2e433fdbf2c3..370b581cd48c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Local includes */
@@ -11002,7 +11003,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 * will use any remaining vectors to reach as close as we can to the
 * number of online CPUs.
 */
-   cpus = num_online_cpus();
+   cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
vectors_left -= pf->num_lan_msix;
 
-- 
2.18.2



[PATCH v3 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

2020-09-25 Thread Nitesh Narayan Lal
Extend nohz_full feature set to include isolation from managed IRQS. This
is required specifically for setups that only uses nohz_full and still
requires isolation for maintaining lower latency for the listed CPUs.

Suggested-by: Frederic Weisbecker 
Signed-off-by: Nitesh Narayan Lal 
---
 kernel/sched/isolation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..9df9598a9e39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
unsigned int flags;
 
flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
-   HK_FLAG_MISC | HK_FLAG_KTHREAD;
+   HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
 
return housekeeping_setup(str, flags);
 }
-- 
2.18.2



Re: [PATCH v2 4/4] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-24 Thread Nitesh Narayan Lal

On 9/24/20 6:59 PM, Bjorn Helgaas wrote:
> On Thu, Sep 24, 2020 at 05:39:07PM -0400, Nitesh Narayan Lal wrote:
>> On 9/24/20 4:45 PM, Bjorn Helgaas wrote:
>>> Possible subject:
>>>
>>>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
>> Will switch to this.
>>
>>> On Wed, Sep 23, 2020 at 02:11:26PM -0400, Nitesh Narayan Lal wrote:
>>>> This patch limits the pci_alloc_irq_vectors, max_vecs argument that is
>>>> passed on by the caller based on the housekeeping online CPUs (that are
>>>> meant to perform managed IRQ jobs).
>>>>
>>>> A minimum of the max_vecs passed and housekeeping online CPUs is derived
>>>> to ensure that we don't create excess vectors as that can be problematic
>>>> specifically in an RT environment. In cases where the min_vecs exceeds the
>>>> housekeeping online CPUs, max vecs is restricted based on the min_vecs
>>>> instead. The proposed change is required because for an RT environment
>>>> unwanted IRQs are moved to the housekeeping CPUs from isolated CPUs to
>>>> keep the latency overhead to a minimum. If the number of housekeeping CPUs
>>>> is significantly lower than that of the isolated CPUs we can run into
>>>> failures while moving these IRQs to housekeeping CPUs due to per CPU
>>>> vector limit.
>>> Does this capture enough of the log?
>>>
>>>   If we have isolated CPUs dedicated for use by real-time tasks, we
>>>   try to move IRQs to housekeeping CPUs to reduce overhead on the
>>>   isolated CPUs.
>> How about:
>> "
>> If we have isolated CPUs or CPUs running in nohz_full mode for the purpose
>> of real-time, we try to move IRQs to housekeeping CPUs to reduce latency
>> overhead on these real-time CPUs.
>> "
>>
>> What do you think?
> It's OK, but from the PCI core perspective, "nohz_full mode" doesn't
> really mean anything.  I think it's a detail that should be inside the
> "housekeeping CPU" abstraction.

I get your point, in that case I will probably stick to your original
suggestion just replacing the term "overhead" with "latency overhead".

>
>>>   If we allocate too many IRQ vectors, moving them all to housekeeping
>>>   CPUs may exceed per-CPU vector limits.
>>>
>>>   When we have isolated CPUs, limit the number of vectors allocated by
>>>   pci_alloc_irq_vectors() to the minimum number required by the
>>>   driver, or to one per housekeeping CPU if that is larger
>> I think this is good, I can adopt this.
>>
>>> .
>>>
>>>> Signed-off-by: Nitesh Narayan Lal 
>>>> ---
>>>>  include/linux/pci.h | 15 +++
>>>>  1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 835530605c0d..cf9ca9410213 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -38,6 +38,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  
>>>>  #include 
>>>> @@ -1797,6 +1798,20 @@ static inline int
>>>>  pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>>>>  unsigned int max_vecs, unsigned int flags)
>>>>  {
>>>> +  unsigned int hk_cpus = hk_num_online_cpus();
>>>> +
>>>> +  /*
>>>> +   * For a real-time environment, try to be conservative and at max only
>>>> +   * ask for the same number of vectors as there are housekeeping online
>>>> +   * CPUs. In case, the min_vecs requested exceeds the housekeeping
>>>> +   * online CPUs, restrict the max_vecs based on the min_vecs instead.
>>>> +   */
>>>> +  if (hk_cpus != num_online_cpus()) {
>>>> +  if (min_vecs > hk_cpus)
>>>> +  max_vecs = min_vecs;
>>>> +  else
>>>> +  max_vecs = min_t(int, max_vecs, hk_cpus);
>>>> +  }
>>> Is the below basically the same?
>>>
>>> /*
>>>  * If we have isolated CPUs for use by real-time tasks,
>>>  * minimize overhead on those CPUs by moving IRQs to the
>>>  * remaining "housekeeping" CPUs.  Limit vector usage to keep
>>>  * housekeeping CPUs from running out of IRQ vectors.
>>>  */
>>

Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread Nitesh Narayan Lal

On 9/24/20 4:47 PM, Bjorn Helgaas wrote:
> On Wed, Sep 23, 2020 at 02:11:23PM -0400, Nitesh Narayan Lal wrote:
>> Introduce a new API hk_num_online_cpus(), that can be used to
>> retrieve the number of online housekeeping CPUs that are meant to handle
>> managed IRQ jobs.
>>
>> This API is introduced for the drivers that were previously relying only
>> on num_online_cpus() to determine the number of MSIX vectors to create.
>> In an RT environment with large isolated but fewer housekeeping CPUs this
>> was leading to a situation where an attempt to move all of the vectors
>> corresponding to isolated CPUs to housekeeping CPUs were failing due to
>> per CPU vector limit.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/sched/isolation.h | 13 +
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/sched/isolation.h 
>> b/include/linux/sched/isolation.h
>> index cc9f393e2a70..2e96b626e02e 100644
>> --- a/include/linux/sched/isolation.h
>> +++ b/include/linux/sched/isolation.h
>> @@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum 
>> hk_flags flags)
>>  return true;
>>  }
>>  
>> +static inline unsigned int hk_num_online_cpus(void)
>> +{
>> +#ifdef CONFIG_CPU_ISOLATION
>> +const struct cpumask *hk_mask;
>> +
>> +if (static_branch_unlikely(&housekeeping_overridden)) {
>> +hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
>> +return cpumask_weight(hk_mask);
>> +}
>> +#endif
>> +return cpumask_weight(cpu_online_mask);
> Just curious: why is this not
>
>   #ifdef CONFIG_CPU_ISOLATION
>   ...
>   #endif
> return num_online_cpus();

I think doing an atomic read is better than a bitmap operation.
Thanks for pointing this out.

>
>> +}
>> +
>>  #endif /* _LINUX_SCHED_ISOLATION_H */
>> -- 
>> 2.18.2
>>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 4/4] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-24 Thread Nitesh Narayan Lal

On 9/24/20 4:45 PM, Bjorn Helgaas wrote:
> Possible subject:
>
>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

Will switch to this.

>
> On Wed, Sep 23, 2020 at 02:11:26PM -0400, Nitesh Narayan Lal wrote:
>> This patch limits the pci_alloc_irq_vectors, max_vecs argument that is
>> passed on by the caller based on the housekeeping online CPUs (that are
>> meant to perform managed IRQ jobs).
>>
>> A minimum of the max_vecs passed and housekeeping online CPUs is derived
>> to ensure that we don't create excess vectors as that can be problematic
>> specifically in an RT environment. In cases where the min_vecs exceeds the
>> housekeeping online CPUs, max vecs is restricted based on the min_vecs
>> instead. The proposed change is required because for an RT environment
>> unwanted IRQs are moved to the housekeeping CPUs from isolated CPUs to
>> keep the latency overhead to a minimum. If the number of housekeeping CPUs
>> is significantly lower than that of the isolated CPUs we can run into
>> failures while moving these IRQs to housekeeping CPUs due to per CPU
>> vector limit.
> Does this capture enough of the log?
>
>   If we have isolated CPUs dedicated for use by real-time tasks, we
>   try to move IRQs to housekeeping CPUs to reduce overhead on the
>   isolated CPUs.

How about:
"
If we have isolated CPUs or CPUs running in nohz_full mode for the purpose
of real-time, we try to move IRQs to housekeeping CPUs to reduce latency
overhead on these real-time CPUs.
"

What do you think?

>
>   If we allocate too many IRQ vectors, moving them all to housekeeping
>   CPUs may exceed per-CPU vector limits.
>
>   When we have isolated CPUs, limit the number of vectors allocated by
>   pci_alloc_irq_vectors() to the minimum number required by the
>   driver, or to one per housekeeping CPU if that is larger

I think this is good, I can adopt this.

> .
>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/pci.h | 15 +++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 835530605c0d..cf9ca9410213 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -38,6 +38,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -1797,6 +1798,20 @@ static inline int
>>  pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>>unsigned int max_vecs, unsigned int flags)
>>  {
>> +unsigned int hk_cpus = hk_num_online_cpus();
>> +
>> +/*
>> + * For a real-time environment, try to be conservative and at max only
>> + * ask for the same number of vectors as there are housekeeping online
>> + * CPUs. In case, the min_vecs requested exceeds the housekeeping
>> + * online CPUs, restrict the max_vecs based on the min_vecs instead.
>> + */
>> +if (hk_cpus != num_online_cpus()) {
>> +if (min_vecs > hk_cpus)
>> +max_vecs = min_vecs;
>> +else
>> +max_vecs = min_t(int, max_vecs, hk_cpus);
>> +}
> Is the below basically the same?
>
>   /*
>* If we have isolated CPUs for use by real-time tasks,
>* minimize overhead on those CPUs by moving IRQs to the
>* remaining "housekeeping" CPUs.  Limit vector usage to keep
>* housekeeping CPUs from running out of IRQ vectors.
>*/

How about the following as a comment:

"
If we have isolated CPUs or CPUs running in nohz_full mode for real-time,
latency overhead is minimized on those CPUs by moving the IRQ vectors to
the housekeeping CPUs. Limit the number of vectors to keep housekeeping
CPUs from running out of IRQ vectors.

"

>   if (housekeeping_cpus < num_online_cpus()) {
>   if (housekeeping_cpus < min_vecs)
>   max_vecs = min_vecs;
>   else if (housekeeping_cpus < max_vecs)
>   max_vecs = housekeeping_cpus;
>   }

The only reason I went with hk_cpus instead of housekeeping_cpus is because
at other places in the kernel I found a similar naming convention (eg.
hk_mask, hk_flags etc.).
But if housekeeping_cpus makes things more clear, I can switch to that
instead.

Although after Frederic and Peter's suggestion the previous call will change
to something like:

"
housekeeping_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
"

Which should still falls in the that 80 chars per line limit.

>
> My comment isn't quite right because thi

Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread Nitesh Narayan Lal

On 9/24/20 8:46 AM, Peter Zijlstra wrote:
>
> FWIW, cross-posting to moderated lists is annoying. I don't know why we
> allow them in MAINTAINERS :-(

Yeah, it sends out an acknowledgment for every email.
I had to include it because sending the patches to it apparently allows them
to get tested by Intel folks.

>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread Nitesh Narayan Lal

On 9/24/20 8:11 AM, Frederic Weisbecker wrote:
> On Wed, Sep 23, 2020 at 02:11:23PM -0400, Nitesh Narayan Lal wrote:
>> Introduce a new API hk_num_online_cpus(), that can be used to
>> retrieve the number of online housekeeping CPUs that are meant to handle
>> managed IRQ jobs.
>>
>> This API is introduced for the drivers that were previously relying only
>> on num_online_cpus() to determine the number of MSIX vectors to create.
>> In an RT environment with large isolated but fewer housekeeping CPUs this
>> was leading to a situation where an attempt to move all of the vectors
>> corresponding to isolated CPUs to housekeeping CPUs were failing due to
>> per CPU vector limit.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/sched/isolation.h | 13 +
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/sched/isolation.h 
>> b/include/linux/sched/isolation.h
>> index cc9f393e2a70..2e96b626e02e 100644
>> --- a/include/linux/sched/isolation.h
>> +++ b/include/linux/sched/isolation.h
>> @@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum 
>> hk_flags flags)
>>  return true;
>>  }
>>  
>> +static inline unsigned int hk_num_online_cpus(void)
>> +{
>> +#ifdef CONFIG_CPU_ISOLATION
>> +const struct cpumask *hk_mask;
>> +
>> +if (static_branch_unlikely(&housekeeping_overridden)) {
>> +hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
> HK_FLAG_MANAGED_IRQ should be pass as an argument to the function:
>
> housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ) because it's
> completely arbitrary otherwise.


Yeap that is more sensible, I will do that.
Do you have any other concerns/suggestions on any other patch?

>
>> +return cpumask_weight(hk_mask);
>> +}
>> +#endif
>> +return cpumask_weight(cpu_online_mask);
>> +}
>> +
>>  #endif /* _LINUX_SCHED_ISOLATION_H */
>> -- 
>> 2.18.2
>>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread Nitesh Narayan Lal

On 9/24/20 8:09 AM, Frederic Weisbecker wrote:
> On Thu, Sep 24, 2020 at 10:40:29AM +0200, pet...@infradead.org wrote:
>> On Wed, Sep 23, 2020 at 02:11:23PM -0400, Nitesh Narayan Lal wrote:
>>> Introduce a new API hk_num_online_cpus(), that can be used to
>>> retrieve the number of online housekeeping CPUs that are meant to handle
>>> managed IRQ jobs.
>>>
>>> This API is introduced for the drivers that were previously relying only
>>> on num_online_cpus() to determine the number of MSIX vectors to create.
>>> In an RT environment with large isolated but fewer housekeeping CPUs this
>>> was leading to a situation where an attempt to move all of the vectors
>>> corresponding to isolated CPUs to housekeeping CPUs were failing due to
>>> per CPU vector limit.
>>>
>>> Signed-off-by: Nitesh Narayan Lal 
>>> ---
>>>  include/linux/sched/isolation.h | 13 +
>>>  1 file changed, 13 insertions(+)
>>>
>>> diff --git a/include/linux/sched/isolation.h 
>>> b/include/linux/sched/isolation.h
>>> index cc9f393e2a70..2e96b626e02e 100644
>>> --- a/include/linux/sched/isolation.h
>>> +++ b/include/linux/sched/isolation.h
>>> @@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum 
>>> hk_flags flags)
>>> return true;
>>>  }
>>>  
>>> +static inline unsigned int hk_num_online_cpus(void)
>> This breaks with the established naming of that header.
> I guess we can make it housekeeping_num_online_cpus() ?

Right, I can do that.

Thanks

>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


[PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-23 Thread Nitesh Narayan Lal
Introduce a new API hk_num_online_cpus(), that can be used to
retrieve the number of online housekeeping CPUs that are meant to handle
managed IRQ jobs.

This API is introduced for the drivers that were previously relying only
on num_online_cpus() to determine the number of MSIX vectors to create.
In an RT environment with large isolated but fewer housekeeping CPUs this
was leading to a situation where an attempt to move all of the vectors
corresponding to isolated CPUs to housekeeping CPUs were failing due to
per CPU vector limit.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/sched/isolation.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..2e96b626e02e 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags 
flags)
return true;
 }
 
+static inline unsigned int hk_num_online_cpus(void)
+{
+#ifdef CONFIG_CPU_ISOLATION
+   const struct cpumask *hk_mask;
+
+   if (static_branch_unlikely(&housekeeping_overridden)) {
+   hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
+   return cpumask_weight(hk_mask);
+   }
+#endif
+   return cpumask_weight(cpu_online_mask);
+}
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
-- 
2.18.2



[PATCH v2 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

2020-09-23 Thread Nitesh Narayan Lal
Extend nohz_full feature set to include isolation from managed IRQS. This
is required specifically for setups that only uses nohz_full and still
requires isolation for maintaining lower latency for the listed CPUs.

Having this change will ensure that the kernel functions that were using
HK_FLAG_MANAGED_IRQ to derive cpumask for pinning various jobs/IRQs do not
enqueue anything on the CPUs listed under nohz_full.

Suggested-by: Frederic Weisbecker 
Signed-off-by: Nitesh Narayan Lal 
---
 kernel/sched/isolation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..9df9598a9e39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
unsigned int flags;
 
flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
-   HK_FLAG_MISC | HK_FLAG_KTHREAD;
+   HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
 
return housekeeping_setup(str, flags);
 }
-- 
2.18.2



[PATCH v2 0/4] isolation: limit msix vectors based on housekeeping CPUs

2020-09-23 Thread Nitesh Narayan Lal
f-explanatory, got rid of             
  the comment that was added previously.                                     
Patch2:                                                                       
- Added a new patch that is meant to enable managed IRQ isolation for nohz_full
  CPUs. This is based on Frederic Weisbecker's suggestion.              
Patch4 (PCI):                                                                 
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
  of skipping modification to max_vecs, started restricting it based on the
  min_vecs. This is based on a suggestion from Marcelo Tosatti.                 
                                                   

[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/20200909150818.313699-1-nit...@redhat.com/

Nitesh Narayan Lal (4):
  sched/isolation: API to get housekeeping online CPUs
  sched/isolation: Extend nohz_full to isolate managed IRQs
  i40e: limit msix vectors based on housekeeping CPUs
  PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
 include/linux/pci.h | 15 +++
 include/linux/sched/isolation.h | 13 +
 kernel/sched/isolation.c|  2 +-
 4 files changed, 31 insertions(+), 2 deletions(-)

-- 




[PATCH v2 4/4] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-23 Thread Nitesh Narayan Lal
This patch limits the pci_alloc_irq_vectors, max_vecs argument that is
passed on by the caller based on the housekeeping online CPUs (that are
meant to perform managed IRQ jobs).

A minimum of the max_vecs passed and housekeeping online CPUs is derived
to ensure that we don't create excess vectors as that can be problematic
specifically in an RT environment. In cases where the min_vecs exceeds the
housekeeping online CPUs, max vecs is restricted based on the min_vecs
instead. The proposed change is required because for an RT environment
unwanted IRQs are moved to the housekeeping CPUs from isolated CPUs to
keep the latency overhead to a minimum. If the number of housekeeping CPUs
is significantly lower than that of the isolated CPUs we can run into
failures while moving these IRQs to housekeeping CPUs due to per CPU
vector limit.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/pci.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 835530605c0d..cf9ca9410213 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1797,6 +1798,20 @@ static inline int
 pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
  unsigned int max_vecs, unsigned int flags)
 {
+   unsigned int hk_cpus = hk_num_online_cpus();
+
+   /*
+* For a real-time environment, try to be conservative and at max only
+* ask for the same number of vectors as there are housekeeping online
+* CPUs. In case, the min_vecs requested exceeds the housekeeping
+* online CPUs, restrict the max_vecs based on the min_vecs instead.
+*/
+   if (hk_cpus != num_online_cpus()) {
+   if (min_vecs > hk_cpus)
+   max_vecs = min_vecs;
+   else
+   max_vecs = min_t(int, max_vecs, hk_cpus);
+   }
return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
  NULL);
 }
-- 
2.18.2



[PATCH v2 3/4] i40e: limit msix vectors based on housekeeping CPUs

2020-09-23 Thread Nitesh Narayan Lal
In a realtime environment, it is essential to isolate unwanted IRQs from
isolated CPUs to prevent latency overheads. Creating MSIX vectors only
based on the online CPUs could lead to a potential issue on an RT setup
that has several isolated CPUs but a very few housekeeping CPUs. This is
because in these kinds of setups an attempt to move the IRQs to the limited
housekeeping CPUs from isolated CPUs might fail due to the per CPU vector
limit. This could eventually result in latency spikes because of the IRQs
that we fail to move from isolated CPUs.

This patch prevents i40e to create vectors only based on online CPUs by
using hk_num_online_cpus() instead.

Signed-off-by: Nitesh Narayan Lal 
Reviewed-by: Marcelo Tosatti 
Acked-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2e433fdbf2c3..fcb6fa3707e0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Local includes */
@@ -11002,7 +11003,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 * will use any remaining vectors to reach as close as we can to the
 * number of online CPUs.
 */
-   cpus = num_online_cpus();
+   cpus = hk_num_online_cpus();
pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
vectors_left -= pf->num_lan_msix;
 
-- 
2.18.2



Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-22 Thread Nitesh Narayan Lal

On 9/22/20 5:26 PM, Andrew Lunn wrote:
>> Subject: Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of 
>> hosekeeping CPUs
> Hosekeeping? Are these CPUs out gardening in the weeds?

Bjorn has already highlighted the typo, so I will be fixing it in the next
version.
Do you find the commit message and body of this patch unclear?

>
>Andrew
>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-22 Thread Nitesh Narayan Lal

On 9/22/20 4:58 PM, Frederic Weisbecker wrote:
> On Tue, Sep 22, 2020 at 09:50:55AM -0400, Nitesh Narayan Lal wrote:
>> On 9/22/20 6:08 AM, Frederic Weisbecker wrote:
>> TBH I don't have a very strong case here at the moment.
>> But still, IMHO, this will force the user to have both managed irqs and
>> nohz_full in their environments to avoid these kinds of issues. Is that how
>> we would like to proceed?
> Yep that sounds good to me. I never know how much we want to split each and 
> any
> of the isolation features but I'd rather stay cautious to separate 
> HK_FLAG_TICK
> from the rest, just in case running in nohz_full mode ever becomes interesting
> alone for performance and not just latency/isolation.

Fair point.

>
> But look what you can do as well:
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 5a6ea03f9882..9df9598a9e39 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
>   unsigned int flags;
>  
>   flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> - HK_FLAG_MISC | HK_FLAG_KTHREAD;
> + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>  
>   return housekeeping_setup(str, flags);
>  }
>
>
> "nohz_full=" has historically gathered most wanted isolation features. It can
> as well isolate managed irqs.

Nice, yeap this will work.

>
>
>>> And then can we rename it to housekeeping_num_online()?
>> It could be just me, but does something like hk_num_online_cpus() makes more
>> sense here?
> Sure, that works as well.

Thanks a lot for all the help.

>
> Thanks.
>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 2/3] i40e: limit msix vectors based on housekeeping CPUs

2020-09-22 Thread Nitesh Narayan Lal

On 9/22/20 4:44 PM, Frederic Weisbecker wrote:
> On Tue, Sep 22, 2020 at 09:34:02AM -0400, Nitesh Narayan Lal wrote:
>> On 9/22/20 5:54 AM, Frederic Weisbecker wrote:
>>> But I don't also want to push toward a complicated solution to handle CPU 
>>> hotplug
>>> if there is no actual problem to solve there.
>> Sure, even I am not particularly sure about the hotplug scenarios.
> Surely when isolation is something that will be configurable through
> cpuset, it will become interesting. For now it's probably not worth
> adding hundreds lines of code if nobody steps into such issue yet.
>
> What is probably more worthy for now is some piece of code to consolidate
> the allocation of those MSI vectors on top of the number of housekeeping
> CPUs.

+1

>
> Thanks.
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 3/3] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-22 Thread Nitesh Narayan Lal

On 9/10/20 3:31 PM, Nitesh Narayan Lal wrote:
> On 9/10/20 3:22 PM, Marcelo Tosatti wrote:
>> On Wed, Sep 09, 2020 at 11:08:18AM -0400, Nitesh Narayan Lal wrote:
>>> This patch limits the pci_alloc_irq_vectors max vectors that is passed on
>>> by the caller based on the available housekeeping CPUs by only using the
>>> minimum of the two.
>>>
>>> A minimum of the max_vecs passed and available housekeeping CPUs is
>>> derived to ensure that we don't create excess vectors which can be
>>> problematic specifically in an RT environment. This is because for an RT
>>> environment unwanted IRQs are moved to the housekeeping CPUs from
>>> isolated CPUs to keep the latency overhead to a minimum. If the number of
>>> housekeeping CPUs are significantly lower than that of the isolated CPUs
>>> we can run into failures while moving these IRQs to housekeeping due to
>>> per CPU vector limit.
>>>
>>> Signed-off-by: Nitesh Narayan Lal 
>>> ---
>>>  include/linux/pci.h | 16 
>>>  1 file changed, 16 insertions(+)
>>>
>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>> index 835530605c0d..750ba927d963 100644
>>> --- a/include/linux/pci.h
>>> +++ b/include/linux/pci.h
>>> @@ -38,6 +38,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  
>>>  #include 
>>> @@ -1797,6 +1798,21 @@ static inline int
>>>  pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>>>   unsigned int max_vecs, unsigned int flags)
>>>  {
>>> +   unsigned int num_housekeeping = num_housekeeping_cpus();
>>> +   unsigned int num_online = num_online_cpus();
>>> +
>>> +   /*
>>> +* Try to be conservative and at max only ask for the same number of
>>> +* vectors as there are housekeeping CPUs. However, skip any
>>> +* modification to the of max vectors in two conditions:
>>> +* 1. If the min_vecs requested are higher than that of the
>>> +*housekeeping CPUs as we don't want to prevent the initialization
>>> +*of a device.
>>> +* 2. If there are no isolated CPUs as in this case the driver should
>>> +*already have taken online CPUs into consideration.
>>> +*/
>>> +   if (min_vecs < num_housekeeping && num_housekeeping != num_online)
>>> +   max_vecs = min_t(int, max_vecs, num_housekeeping);
>>> return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
>>>   NULL);
>>>  }
>> If min_vecs > num_housekeeping, for example:
>>
>> /* PCI MSI/MSIx support */
>> #define XGBE_MSI_BASE_COUNT 4
>> #define XGBE_MSI_MIN_COUNT  (XGBE_MSI_BASE_COUNT + 1)
>>
>> Then the protection fails.
> Right, I was ignoring that case.
>
>> How about reducing max_vecs down to min_vecs, if min_vecs >
>> num_housekeeping ?
> Yes, I think this makes sense.
> I will wait a bit to see if anyone else has any other comment and will post
> the next version then.
>

Are there any other comments/concerns on this patch that I need to address in
the next posting?

-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-22 Thread Nitesh Narayan Lal

On 9/22/20 6:08 AM, Frederic Weisbecker wrote:
> On Mon, Sep 21, 2020 at 11:16:51PM -0400, Nitesh Narayan Lal wrote:
>> On 9/21/20 7:40 PM, Frederic Weisbecker wrote:
>>> On Wed, Sep 09, 2020 at 11:08:16AM -0400, Nitesh Narayan Lal wrote:
>>>> +/*
>>>> + * num_housekeeping_cpus() - Read the number of housekeeping CPUs.
>>>> + *
>>>> + * This function returns the number of available housekeeping CPUs
>>>> + * based on __num_housekeeping_cpus which is of type atomic_t
>>>> + * and is initialized at the time of the housekeeping setup.
>>>> + */
>>>> +unsigned int num_housekeeping_cpus(void)
>>>> +{
>>>> +  unsigned int cpus;
>>>> +
>>>> +  if (static_branch_unlikely(&housekeeping_overridden)) {
>>>> +  cpus = atomic_read(&__num_housekeeping_cpus);
>>>> +  /* We should always have at least one housekeeping CPU */
>>>> +  BUG_ON(!cpus);
>>>> +  return cpus;
>>>> +  }
>>>> +  return num_online_cpus();
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(num_housekeeping_cpus);
>>>> +
>>>>  int housekeeping_any_cpu(enum hk_flags flags)
>>>>  {
>>>>int cpu;
>>>> @@ -131,6 +153,7 @@ static int __init housekeeping_setup(char *str, enum 
>>>> hk_flags flags)
>>>>  
>>>>housekeeping_flags |= flags;
>>>>  
>>>> +  atomic_set(&__num_housekeeping_cpus, cpumask_weight(housekeeping_mask));
>>> So the problem here is that it takes the whole cpumask weight but you're 
>>> only
>>> interested in the housekeepers who take the managed irq duties I guess
>>> (HK_FLAG_MANAGED_IRQ ?).
>> IMHO we should also consider the cases where we only have nohz_full.
>> Otherwise, we may run into the same situation on those setups, do you agree?
> I guess it's up to the user to gather the tick and managed irq housekeeping
> together?

TBH I don't have a very strong case here at the moment.
But still, IMHO, this will force the user to have both managed irqs and
nohz_full in their environments to avoid these kinds of issues. Is that how
we would like to proceed?

The reason why I want to get this clarity is that going forward for any RT
related work I can form my thoughts based on this discussion.

>
> Of course that makes the implementation more complicated. But if this is
> called only on drivers initialization for now, this could be just a function
> that does:
>
> cpumask_weight(cpu_online_mask | housekeeping_cpumask(HK_FLAG_MANAGED_IRQ))

Ack, this makes more sense.

>
> And then can we rename it to housekeeping_num_online()?

It could be just me, but does something like hk_num_online_cpus() makes more
sense here?

>
> Thanks.
>
>>>>free_bootmem_cpumask_var(non_housekeeping_mask);
>>>>  
>>>>return 1;
>>>> -- 
>>>> 2.27.0
>>>>
>> -- 
>> Thanks
>> Nitesh
>>
>
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 2/3] i40e: limit msix vectors based on housekeeping CPUs

2020-09-22 Thread Nitesh Narayan Lal

On 9/22/20 5:54 AM, Frederic Weisbecker wrote:
> On Mon, Sep 21, 2020 at 11:08:20PM -0400, Nitesh Narayan Lal wrote:
>> On 9/21/20 6:58 PM, Frederic Weisbecker wrote:
>>> On Thu, Sep 17, 2020 at 11:23:59AM -0700, Jesse Brandeburg wrote:
>>>> Nitesh Narayan Lal wrote:
>>>>
>>>>> In a realtime environment, it is essential to isolate unwanted IRQs from
>>>>> isolated CPUs to prevent latency overheads. Creating MSIX vectors only
>>>>> based on the online CPUs could lead to a potential issue on an RT setup
>>>>> that has several isolated CPUs but a very few housekeeping CPUs. This is
>>>>> because in these kinds of setups an attempt to move the IRQs to the
>>>>> limited housekeeping CPUs from isolated CPUs might fail due to the per
>>>>> CPU vector limit. This could eventually result in latency spikes because
>>>>> of the IRQ threads that we fail to move from isolated CPUs.
>>>>>
>>>>> This patch prevents i40e to add vectors only based on available
>>>>> housekeeping CPUs by using num_housekeeping_cpus().
>>>>>
>>>>> Signed-off-by: Nitesh Narayan Lal 
>>>> The driver changes are straightforward, but this isn't the only driver
>>>> with this issue, right?  I'm sure ixgbe and ice both have this problem
>>>> too, you should fix them as well, at a minimum, and probably other
>>>> vendors drivers:
>>>>
>>>> $ rg -c --stats num_online_cpus drivers/net/ethernet
>>>> ...
>>>> 50 files contained matches
>>> Ouch, I was indeed surprised that these MSI vector allocations were done
>>> at the driver level and not at some $SUBSYSTEM level.
>>>
>>> The logic is already there in the driver so I wouldn't oppose to this very 
>>> patch
>>> but would a shared infrastructure make sense for this? Something that would
>>> also handle hotplug operations?
>>>
>>> Does it possibly go even beyond networking drivers?
>> From a generic solution perspective, I think it makes sense to come up with a
>> shared infrastructure.
>> Something that can be consumed by all the drivers and maybe hotplug 
>> operations
>> as well (I will have to further explore the hotplug part).
> That would be great. I'm completely clueless about those MSI things and the
> actual needs of those drivers. Now it seems to me that if several CPUs become
> offline, or as is planned in the future, CPU isolation gets enabled/disabled
> through cpuset, then the vectors may need some reorganization.

+1

>
> But I don't also want to push toward a complicated solution to handle CPU 
> hotplug
> if there is no actual problem to solve there.

Sure, even I am not particularly sure about the hotplug scenarios.

>  So I let you guys judge.
>
>> However, there are RT workloads that are getting affected because of this
>> issue, so does it make sense to go ahead with this per-driver basis approach
>> for now?
> Yep that sounds good.

Thank you for confirming.

>
>> Since a generic solution will require a fair amount of testing and
>> understanding of different drivers. Having said that, I can definetly start
>> looking in that direction.
> Thanks a lot!
>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-21 Thread Nitesh Narayan Lal

On 9/21/20 7:40 PM, Frederic Weisbecker wrote:
> On Wed, Sep 09, 2020 at 11:08:16AM -0400, Nitesh Narayan Lal wrote:
>> +/*
>> + * num_housekeeping_cpus() - Read the number of housekeeping CPUs.
>> + *
>> + * This function returns the number of available housekeeping CPUs
>> + * based on __num_housekeeping_cpus which is of type atomic_t
>> + * and is initialized at the time of the housekeeping setup.
>> + */
>> +unsigned int num_housekeeping_cpus(void)
>> +{
>> +unsigned int cpus;
>> +
>> +if (static_branch_unlikely(&housekeeping_overridden)) {
>> +cpus = atomic_read(&__num_housekeeping_cpus);
>> +/* We should always have at least one housekeeping CPU */
>> +BUG_ON(!cpus);
>> +return cpus;
>> +}
>> +return num_online_cpus();
>> +}
>> +EXPORT_SYMBOL_GPL(num_housekeeping_cpus);
>> +
>>  int housekeeping_any_cpu(enum hk_flags flags)
>>  {
>>  int cpu;
>> @@ -131,6 +153,7 @@ static int __init housekeeping_setup(char *str, enum 
>> hk_flags flags)
>>  
>>  housekeeping_flags |= flags;
>>  
>> +atomic_set(&__num_housekeeping_cpus, cpumask_weight(housekeeping_mask));
> So the problem here is that it takes the whole cpumask weight but you're only
> interested in the housekeepers who take the managed irq duties I guess
> (HK_FLAG_MANAGED_IRQ ?).

IMHO we should also consider the cases where we only have nohz_full.
Otherwise, we may run into the same situation on those setups, do you agree?

>
>>  free_bootmem_cpumask_var(non_housekeeping_mask);
>>  
>>  return 1;
>> -- 
>> 2.27.0
>>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 2/3] i40e: limit msix vectors based on housekeeping CPUs

2020-09-21 Thread Nitesh Narayan Lal

On 9/21/20 6:58 PM, Frederic Weisbecker wrote:
> On Thu, Sep 17, 2020 at 11:23:59AM -0700, Jesse Brandeburg wrote:
>> Nitesh Narayan Lal wrote:
>>
>>> In a realtime environment, it is essential to isolate unwanted IRQs from
>>> isolated CPUs to prevent latency overheads. Creating MSIX vectors only
>>> based on the online CPUs could lead to a potential issue on an RT setup
>>> that has several isolated CPUs but a very few housekeeping CPUs. This is
>>> because in these kinds of setups an attempt to move the IRQs to the
>>> limited housekeeping CPUs from isolated CPUs might fail due to the per
>>> CPU vector limit. This could eventually result in latency spikes because
>>> of the IRQ threads that we fail to move from isolated CPUs.
>>>
>>> This patch prevents i40e to add vectors only based on available
>>> housekeeping CPUs by using num_housekeeping_cpus().
>>>
>>> Signed-off-by: Nitesh Narayan Lal 
>> The driver changes are straightforward, but this isn't the only driver
>> with this issue, right?  I'm sure ixgbe and ice both have this problem
>> too, you should fix them as well, at a minimum, and probably other
>> vendors drivers:
>>
>> $ rg -c --stats num_online_cpus drivers/net/ethernet
>> ...
>> 50 files contained matches
> Ouch, I was indeed surprised that these MSI vector allocations were done
> at the driver level and not at some $SUBSYSTEM level.
>
> The logic is already there in the driver so I wouldn't oppose to this very 
> patch
> but would a shared infrastructure make sense for this? Something that would
> also handle hotplug operations?
>
> Does it possibly go even beyond networking drivers?

>From a generic solution perspective, I think it makes sense to come up with a
shared infrastructure.
Something that can be consumed by all the drivers and maybe hotplug operations
as well (I will have to further explore the hotplug part).

However, there are RT workloads that are getting affected because of this
issue, so does it make sense to go ahead with this per-driver basis approach
for now?

Since a generic solution will require a fair amount of testing and
understanding of different drivers. Having said that, I can definetly start
looking in that direction.

Thanks for reviewing.

> Thanks.
>
>> for this patch i40e
>> Acked-by: Jesse Brandeburg 
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-17 Thread Nitesh Narayan Lal

On 9/17/20 4:11 PM, Bjorn Helgaas wrote:
> [+cc Ingo, Peter, Juri, Vincent (scheduler maintainers)]
>
> s/hosekeeping/housekeeping/ (in subject)
>
> On Wed, Sep 09, 2020 at 11:08:16AM -0400, Nitesh Narayan Lal wrote:
>> Introduce a new API num_housekeeping_cpus(), that can be used to retrieve
>> the number of housekeeping CPUs by reading an atomic variable
>> __num_housekeeping_cpus. This variable is set from housekeeping_setup().
>>
>> This API is introduced for the purpose of drivers that were previously
>> relying only on num_online_cpus() to determine the number of MSIX vectors
>> to create. In an RT environment with large isolated but a fewer
>> housekeeping CPUs this was leading to a situation where an attempt to
>> move all of the vectors corresponding to isolated CPUs to housekeeping
>> CPUs was failing due to per CPU vector limit.
> Totally kibitzing here, but AFAICT the concepts of "isolated CPU" and
> "housekeeping CPU" are not currently exposed to drivers, and it's not
> completely clear to me that they should be.
>
> We have carefully constructed notions of possible, present, online,
> active CPUs, and it seems like whatever we do here should be
> somehow integrated with those.

At one point I thought about tweaking num_online_cpus(), but then I quickly
moved away from that just because it is extensively used in the kernel and we
don't have to modify the behavior at all those places.

Thank you for including Peter and Vincent as well.
I would be happy to discuss/explore other options.

>
>> If there are no isolated CPUs specified then the API returns the number
>> of all online CPUs.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/sched/isolation.h |  7 +++
>>  kernel/sched/isolation.c| 23 +++
>>  2 files changed, 30 insertions(+)
>>
>> diff --git a/include/linux/sched/isolation.h 
>> b/include/linux/sched/isolation.h
>> index cc9f393e2a70..94c25d956d8a 100644
>> --- a/include/linux/sched/isolation.h
>> +++ b/include/linux/sched/isolation.h
>> @@ -25,6 +25,7 @@ extern bool housekeeping_enabled(enum hk_flags flags);
>>  extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags);
>>  extern bool housekeeping_test_cpu(int cpu, enum hk_flags flags);
>>  extern void __init housekeeping_init(void);
>> +extern unsigned int num_housekeeping_cpus(void);
>>  
>>  #else
>>  
>> @@ -46,6 +47,12 @@ static inline bool housekeeping_enabled(enum hk_flags 
>> flags)
>>  static inline void housekeeping_affine(struct task_struct *t,
>> enum hk_flags flags) { }
>>  static inline void housekeeping_init(void) { }
>> +
>> +static unsigned int num_housekeeping_cpus(void)
>> +{
>> +return num_online_cpus();
>> +}
>> +
>>  #endif /* CONFIG_CPU_ISOLATION */
>>  
>>  static inline bool housekeeping_cpu(int cpu, enum hk_flags flags)
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 5a6ea03f9882..7024298390b7 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -13,6 +13,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
>>  EXPORT_SYMBOL_GPL(housekeeping_overridden);
>>  static cpumask_var_t housekeeping_mask;
>>  static unsigned int housekeeping_flags;
>> +static atomic_t __num_housekeeping_cpus __read_mostly;
>>  
>>  bool housekeeping_enabled(enum hk_flags flags)
>>  {
>> @@ -20,6 +21,27 @@ bool housekeeping_enabled(enum hk_flags flags)
>>  }
>>  EXPORT_SYMBOL_GPL(housekeeping_enabled);
>>  
>> +/*
>> + * num_housekeeping_cpus() - Read the number of housekeeping CPUs.
>> + *
>> + * This function returns the number of available housekeeping CPUs
>> + * based on __num_housekeeping_cpus which is of type atomic_t
>> + * and is initialized at the time of the housekeeping setup.
>> + */
>> +unsigned int num_housekeeping_cpus(void)
>> +{
>> +unsigned int cpus;
>> +
>> +if (static_branch_unlikely(&housekeeping_overridden)) {
>> +cpus = atomic_read(&__num_housekeeping_cpus);
>> +/* We should always have at least one housekeeping CPU */
>> +BUG_ON(!cpus);
>> +return cpus;
>> +}
>> +return num_online_cpus();
>> +}
>> +EXPORT_SYMBOL_GPL(num_housekeeping_cpus);
>> +
>>  int housekeeping_any_cpu(enum hk_flags flags)
>>  {
>>  int cpu;
>> @@ -131,6 +153,7 @@ static int __init housekeeping_setup(char *str, enum 
>> hk_flags flags)
>>  
>>  housekeeping_flags |= flags;
>>  
>> +atomic_set(&__num_housekeeping_cpus, cpumask_weight(housekeeping_mask));
>>  free_bootmem_cpumask_var(non_housekeeping_mask);
>>  
>>  return 1;
>> -- 
>> 2.27.0
>>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-17 Thread Nitesh Narayan Lal

On 9/17/20 2:18 PM, Jesse Brandeburg wrote:
> Nitesh Narayan Lal wrote:
>
>> Introduce a new API num_housekeeping_cpus(), that can be used to retrieve
>> the number of housekeeping CPUs by reading an atomic variable
>> __num_housekeeping_cpus. This variable is set from housekeeping_setup().
>>
>> This API is introduced for the purpose of drivers that were previously
>> relying only on num_online_cpus() to determine the number of MSIX vectors
>> to create. In an RT environment with large isolated but a fewer
>> housekeeping CPUs this was leading to a situation where an attempt to
>> move all of the vectors corresponding to isolated CPUs to housekeeping
>> CPUs was failing due to per CPU vector limit.
>>
>> If there are no isolated CPUs specified then the API returns the number
>> of all online CPUs.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/sched/isolation.h |  7 +++
>>  kernel/sched/isolation.c| 23 +++
>>  2 files changed, 30 insertions(+)
> I'm not a scheduler expert, but a couple comments follow.
>
>> diff --git a/include/linux/sched/isolation.h 
>> b/include/linux/sched/isolation.h
>> index cc9f393e2a70..94c25d956d8a 100644
>> --- a/include/linux/sched/isolation.h
>> +++ b/include/linux/sched/isolation.h
>> @@ -25,6 +25,7 @@ extern bool housekeeping_enabled(enum hk_flags flags);
>>  extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags);
>>  extern bool housekeeping_test_cpu(int cpu, enum hk_flags flags);
>>  extern void __init housekeeping_init(void);
>> +extern unsigned int num_housekeeping_cpus(void);
>>  
>>  #else
>>  
>> @@ -46,6 +47,12 @@ static inline bool housekeeping_enabled(enum hk_flags 
>> flags)
>>  static inline void housekeeping_affine(struct task_struct *t,
>> enum hk_flags flags) { }
>>  static inline void housekeeping_init(void) { }
>> +
>> +static unsigned int num_housekeeping_cpus(void)
>> +{
>> +return num_online_cpus();
>> +}
>> +
>>  #endif /* CONFIG_CPU_ISOLATION */
>>  
>>  static inline bool housekeeping_cpu(int cpu, enum hk_flags flags)
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 5a6ea03f9882..7024298390b7 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -13,6 +13,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
>>  EXPORT_SYMBOL_GPL(housekeeping_overridden);
>>  static cpumask_var_t housekeeping_mask;
>>  static unsigned int housekeeping_flags;
>> +static atomic_t __num_housekeeping_cpus __read_mostly;
>>  
>>  bool housekeeping_enabled(enum hk_flags flags)
>>  {
>> @@ -20,6 +21,27 @@ bool housekeeping_enabled(enum hk_flags flags)
>>  }
>>  EXPORT_SYMBOL_GPL(housekeeping_enabled);
>>  
>> +/*
> use correct kdoc style, and you get free documentation from your source
> (you're so close!)
>
> should be (note the first line and the function title line change to
> remove parens:
> /**
>  * num_housekeeping_cpus - Read the number of housekeeping CPUs.
>  *
>  * This function returns the number of available housekeeping CPUs
>  * based on __num_housekeeping_cpus which is of type atomic_t
>  * and is initialized at the time of the housekeeping setup.
>  */

My bad, I missed that.
Thanks for pointing it out.

>
>> + * num_housekeeping_cpus() - Read the number of housekeeping CPUs.
>> + *
>> + * This function returns the number of available housekeeping CPUs
>> + * based on __num_housekeeping_cpus which is of type atomic_t
>> + * and is initialized at the time of the housekeeping setup.
>> + */
>> +unsigned int num_housekeeping_cpus(void)
>> +{
>> +unsigned int cpus;
>> +
>> +if (static_branch_unlikely(&housekeeping_overridden)) {
>> +cpus = atomic_read(&__num_housekeeping_cpus);
>> +/* We should always have at least one housekeeping CPU */
>> +BUG_ON(!cpus);
> you need to crash the kernel because of this? maybe a WARN_ON? How did
> the global even get set to the bad value? It's going to blame the poor
> caller for this in the trace, but the caller likely had nothing to do
> with setting the value incorrectly!

Yes, ideally this should not be triggered, but if somehow it does then we have
a bug and that needs to be fixed. That's probably the only reason why I chose
BUG_ON.
But, I am not entirely against the usage of WARN_ON either, because we get a
stack trace anyways.
I will see if anyone else has any other concerns on this patch and then I can
post the next version.

>
>> +return cpus;
>> +}
>> +return num_online_cpus();
>> +}
>> +EXPORT_SYMBOL_GPL(num_housekeeping_cpus);
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 2/3] i40e: limit msix vectors based on housekeeping CPUs

2020-09-17 Thread Nitesh Narayan Lal

On 9/17/20 2:23 PM, Jesse Brandeburg wrote:
> Nitesh Narayan Lal wrote:
>
>> In a realtime environment, it is essential to isolate unwanted IRQs from
>> isolated CPUs to prevent latency overheads. Creating MSIX vectors only
>> based on the online CPUs could lead to a potential issue on an RT setup
>> that has several isolated CPUs but a very few housekeeping CPUs. This is
>> because in these kinds of setups an attempt to move the IRQs to the
>> limited housekeeping CPUs from isolated CPUs might fail due to the per
>> CPU vector limit. This could eventually result in latency spikes because
>> of the IRQ threads that we fail to move from isolated CPUs.
>>
>> This patch prevents i40e to add vectors only based on available
>> housekeeping CPUs by using num_housekeeping_cpus().
>>
>> Signed-off-by: Nitesh Narayan Lal 
> The driver changes are straightforward, but this isn't the only driver
> with this issue, right?

Indeed, I was hoping to modify them over the time with some testing.

>   I'm sure ixgbe and ice both have this problem
> too, you should fix them as well, at a minimum, and probably other
> vendors drivers:

Sure, I can atleast include ixgbe and ice in the next posting if that makes
sense.
The reason I skipped them is that I was not very sure about the right way to
test these changes.

>
> $ rg -c --stats num_online_cpus drivers/net/ethernet
> ...
> 50 files contained matches
>
> for this patch i40e
> Acked-by: Jesse Brandeburg 
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v1 3/3] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-10 Thread Nitesh Narayan Lal

On 9/10/20 3:22 PM, Marcelo Tosatti wrote:
> On Wed, Sep 09, 2020 at 11:08:18AM -0400, Nitesh Narayan Lal wrote:
>> This patch limits the pci_alloc_irq_vectors max vectors that is passed on
>> by the caller based on the available housekeeping CPUs by only using the
>> minimum of the two.
>>
>> A minimum of the max_vecs passed and available housekeeping CPUs is
>> derived to ensure that we don't create excess vectors which can be
>> problematic specifically in an RT environment. This is because for an RT
>> environment unwanted IRQs are moved to the housekeeping CPUs from
>> isolated CPUs to keep the latency overhead to a minimum. If the number of
>> housekeeping CPUs are significantly lower than that of the isolated CPUs
>> we can run into failures while moving these IRQs to housekeeping due to
>> per CPU vector limit.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  include/linux/pci.h | 16 
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 835530605c0d..750ba927d963 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -38,6 +38,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -1797,6 +1798,21 @@ static inline int
>>  pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>>unsigned int max_vecs, unsigned int flags)
>>  {
>> +unsigned int num_housekeeping = num_housekeeping_cpus();
>> +unsigned int num_online = num_online_cpus();
>> +
>> +/*
>> + * Try to be conservative and at max only ask for the same number of
>> + * vectors as there are housekeeping CPUs. However, skip any
>> + * modification to the of max vectors in two conditions:
>> + * 1. If the min_vecs requested are higher than that of the
>> + *housekeeping CPUs as we don't want to prevent the initialization
>> + *of a device.
>> + * 2. If there are no isolated CPUs as in this case the driver should
>> + *already have taken online CPUs into consideration.
>> + */
>> +if (min_vecs < num_housekeeping && num_housekeeping != num_online)
>> +max_vecs = min_t(int, max_vecs, num_housekeeping);
>>  return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
>>NULL);
>>  }
> If min_vecs > num_housekeeping, for example:
>
> /* PCI MSI/MSIx support */
> #define XGBE_MSI_BASE_COUNT 4
> #define XGBE_MSI_MIN_COUNT  (XGBE_MSI_BASE_COUNT + 1)
>
> Then the protection fails.

Right, I was ignoring that case.

>
> How about reducing max_vecs down to min_vecs, if min_vecs >
> num_housekeeping ?

Yes, I think this makes sense.
I will wait a bit to see if anyone else has any other comment and will post
the next version then.

>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


[RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs

2020-09-09 Thread Nitesh Narayan Lal
Introduce a new API num_housekeeping_cpus(), that can be used to retrieve
the number of housekeeping CPUs by reading an atomic variable
__num_housekeeping_cpus. This variable is set from housekeeping_setup().

This API is introduced for the purpose of drivers that were previously
relying only on num_online_cpus() to determine the number of MSIX vectors
to create. In an RT environment with large isolated but a fewer
housekeeping CPUs this was leading to a situation where an attempt to
move all of the vectors corresponding to isolated CPUs to housekeeping
CPUs was failing due to per CPU vector limit.

If there are no isolated CPUs specified then the API returns the number
of all online CPUs.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/sched/isolation.h |  7 +++
 kernel/sched/isolation.c| 23 +++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..94c25d956d8a 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -25,6 +25,7 @@ extern bool housekeeping_enabled(enum hk_flags flags);
 extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags);
 extern bool housekeeping_test_cpu(int cpu, enum hk_flags flags);
 extern void __init housekeeping_init(void);
+extern unsigned int num_housekeeping_cpus(void);
 
 #else
 
@@ -46,6 +47,12 @@ static inline bool housekeeping_enabled(enum hk_flags flags)
 static inline void housekeeping_affine(struct task_struct *t,
   enum hk_flags flags) { }
 static inline void housekeeping_init(void) { }
+
+static unsigned int num_housekeeping_cpus(void)
+{
+   return num_online_cpus();
+}
+
 #endif /* CONFIG_CPU_ISOLATION */
 
 static inline bool housekeeping_cpu(int cpu, enum hk_flags flags)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..7024298390b7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -13,6 +13,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
 EXPORT_SYMBOL_GPL(housekeeping_overridden);
 static cpumask_var_t housekeeping_mask;
 static unsigned int housekeeping_flags;
+static atomic_t __num_housekeeping_cpus __read_mostly;
 
 bool housekeeping_enabled(enum hk_flags flags)
 {
@@ -20,6 +21,27 @@ bool housekeeping_enabled(enum hk_flags flags)
 }
 EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
+/*
+ * num_housekeeping_cpus() - Read the number of housekeeping CPUs.
+ *
+ * This function returns the number of available housekeeping CPUs
+ * based on __num_housekeeping_cpus which is of type atomic_t
+ * and is initialized at the time of the housekeeping setup.
+ */
+unsigned int num_housekeeping_cpus(void)
+{
+   unsigned int cpus;
+
+   if (static_branch_unlikely(&housekeeping_overridden)) {
+   cpus = atomic_read(&__num_housekeeping_cpus);
+   /* We should always have at least one housekeeping CPU */
+   BUG_ON(!cpus);
+   return cpus;
+   }
+   return num_online_cpus();
+}
+EXPORT_SYMBOL_GPL(num_housekeeping_cpus);
+
 int housekeeping_any_cpu(enum hk_flags flags)
 {
int cpu;
@@ -131,6 +153,7 @@ static int __init housekeeping_setup(char *str, enum 
hk_flags flags)
 
housekeeping_flags |= flags;
 
+   atomic_set(&__num_housekeeping_cpus, cpumask_weight(housekeeping_mask));
free_bootmem_cpumask_var(non_housekeeping_mask);
 
return 1;
-- 
2.27.0



[RFC][Patch v1 2/3] i40e: limit msix vectors based on housekeeping CPUs

2020-09-09 Thread Nitesh Narayan Lal
In a realtime environment, it is essential to isolate unwanted IRQs from
isolated CPUs to prevent latency overheads. Creating MSIX vectors only
based on the online CPUs could lead to a potential issue on an RT setup
that has several isolated CPUs but a very few housekeeping CPUs. This is
because in these kinds of setups an attempt to move the IRQs to the
limited housekeeping CPUs from isolated CPUs might fail due to the per
CPU vector limit. This could eventually result in latency spikes because
of the IRQ threads that we fail to move from isolated CPUs.

This patch prevents i40e to add vectors only based on available
housekeeping CPUs by using num_housekeeping_cpus().

Signed-off-by: Nitesh Narayan Lal 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2e433fdbf2c3..3b4cd4b3de85 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Local includes */
@@ -11002,7 +11003,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 * will use any remaining vectors to reach as close as we can to the
 * number of online CPUs.
 */
-   cpus = num_online_cpus();
+   cpus = num_housekeeping_cpus();
pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
vectors_left -= pf->num_lan_msix;
 
-- 
2.27.0



[RFC][Patch v1 3/3] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs

2020-09-09 Thread Nitesh Narayan Lal
This patch limits the pci_alloc_irq_vectors max vectors that is passed on
by the caller based on the available housekeeping CPUs by only using the
minimum of the two.

A minimum of the max_vecs passed and available housekeeping CPUs is
derived to ensure that we don't create excess vectors which can be
problematic specifically in an RT environment. This is because for an RT
environment unwanted IRQs are moved to the housekeeping CPUs from
isolated CPUs to keep the latency overhead to a minimum. If the number of
housekeeping CPUs are significantly lower than that of the isolated CPUs
we can run into failures while moving these IRQs to housekeeping due to
per CPU vector limit.

Signed-off-by: Nitesh Narayan Lal 
---
 include/linux/pci.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 835530605c0d..750ba927d963 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1797,6 +1798,21 @@ static inline int
 pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
  unsigned int max_vecs, unsigned int flags)
 {
+   unsigned int num_housekeeping = num_housekeeping_cpus();
+   unsigned int num_online = num_online_cpus();
+
+   /*
+* Try to be conservative and at max only ask for the same number of
+* vectors as there are housekeeping CPUs. However, skip any
+* modification to the of max vectors in two conditions:
+* 1. If the min_vecs requested are higher than that of the
+*housekeeping CPUs as we don't want to prevent the initialization
+*of a device.
+* 2. If there are no isolated CPUs as in this case the driver should
+*already have taken online CPUs into consideration.
+*/
+   if (min_vecs < num_housekeeping && num_housekeeping != num_online)
+   max_vecs = min_t(int, max_vecs, num_housekeeping);
return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
  NULL);
 }
-- 
2.27.0



[RFC] [PATCH v1 0/3] isolation: limit msix vectors based on housekeeping CPUs

2020-09-09 Thread Nitesh Narayan Lal
This is a follow-up posting for "[v1] i40e: limit the msix vectors based on
housekeeping CPUs" [1] (It took longer than expected for me to get back to
this).


Issue
=
With the current implementation device drivers while creating their MSIX
vectors only takes num_online_cpus() into consideration which works quite well
for a non-RT environment, but in an RT environment that has a large number of
isolated CPUs and a very few housekeeping CPUs this could lead to a problem.
The problem will be triggered when something like tuned will try to move all
the IRQs from isolated CPUs to the limited number of housekeeping CPUs to
prevent interruptions for a latency sensitive workload that will be runing on
the isolated CPUs. This failure is caused because of the per CPU vector
limitation.


Proposed Fix

In this patch-set, the following changes are proposed:
- A generic API num_housekeeping_cpus() which is meant to return the available
  housekeeping CPUs in an environment with isolated CPUs and all online CPUs
  otherwise.
- i40e: Specifically for the i40e driver the num_online_cpus() used in 
  i40e_init_msix() to calculate numbers msix vectors is replaced with the above
  defined API. This is done to restrict the number of msix vectors for i40e in
  RT environments.
- pci_alloc_irq_vector(): With the help of num_housekeeping_cpus() the max_vecs
  passed in pci_alloc_irq_vector() is restricted only to the available
  housekeeping CPUs only in an environment that has isolated CPUs. However, if
  the min_vecs exceeds the num_housekeeping_cpus(), no change is made to make
  sure that a device initialization is not prevented due to lack of
  housekeeping CPUs.



Reproducing the Issue
=
I have triggered this issue on a setup that had a total of 72 cores among which
68 were isolated and only 4 were left for housekeeping tasks. I was using
tuned's realtime-virtual-host profile to configure the system. In this
scenario, Tuned reported the error message 'Failed to set SMP affinity of IRQ
xxx to '0040,0010,0005': [Errno 28] No space left on the device'
for several IRQs in tuned.log due to the per CPU vector limit.


Testing
===
Functionality:
- To test that the issue is resolved with i40e change I added a tracepoint
  in i40e_init_msix() to find the number of CPUs derived for vector creation
  with and without tuned's realtime-virtual-host profile. As per expectation
  with the profile applied I was only getting the number of housekeeping CPUs
  and all available CPUs without it.

Performance:
- To analyze the performance impact I have targetted the change introduced in 
  pci_alloc_irq_vectors() and compared the results against a vanilla kernel
  (5.9.0-rc3) results.

  Setup Information:
  + I had a couple of 24-core machines connected back to back via a couple of
    mlx5 NICs and I analyzed the average bitrate for server-client TCP and UDP
    transmission via iperf. 
  + To minimize the Bitrate variation of iperf TCP and UDP stream test I have
    applied the tuned's network-throughput profile and disabled HT.
 Test Information:
  + For the environment that had no isolated CPUs:
    I have tested with single stream and 24 streams (same as that of online
    CPUs).
  + For the environment that had 20 isolated CPUs:
    I have tested with single stream, 4 streams (same as that the number of
    housekeeping) and 24 streams (same as that of online CPUs).

 Results:
  # UDP Stream Test:
    + There was no degradation observed in UDP stream tests in both
  environments. (With isolated CPUs and without isolated CPUs after the
  introduction of the patches).
  # TCP Stream Test - No isolated CPUs:
    + No noticeable degradation was observed.
  # TCP Stream Test - With isolated CPUs:
    + Multiple Stream (4)  - Average degradation of around 5-6%
    + Multiple Stream (24) - Average degradation of around 2-3%
    + Single Stream        - Even on a vanilla kernel the Bitrate observed for
 a TCP single stream test seem to vary
 significantly across different runs (eg. the %
 variation between the best and the worst case on
 a vanilla kernel was around 8-10%). A similar
 variation was observed with the kernel that
 included my patches. No additional degradation
 was observed.

Since the change specifically for pci_alloc_irq_vectors is going to impact
several drivers I have posted this patch-set as RFC. I would be happy to
perform more testing based on any suggestions or incorporate any comments to
ensure that the change is not breaking anything.

[1] https://lore.kernel.org/patchwork/patch/1256308/ 

Nitesh Narayan Lal (3):
  sched/isolation: API to get num of hosekeeping CPUs
  i40e: limit msix vectors based on housekeeping C

Re: WARNING: suspicious RCU usage - while installing a VM on a CPU listed under nohz_full

2020-07-30 Thread Nitesh Narayan Lal

On 7/29/20 8:34 AM, Nitesh Narayan Lal wrote:
> On 7/28/20 10:38 PM, Wanpeng Li wrote:
>> Hi Nitesh,
>> On Wed, 29 Jul 2020 at 09:00, Wanpeng Li  wrote:
>>> On Tue, 28 Jul 2020 at 22:40, Nitesh Narayan Lal  wrote:
>>>> Hi,
>>>>
>>>> I have recently come across an RCU trace with the 5.8-rc7 kernel that has 
>>>> the
>>>> debug configs enabled while installing a VM on a CPU that is listed under
>>>> nohz_full.
>>>>
>>>> Based on some of the initial debugging, my impression is that the issue is
>>>> triggered because of the fastpath that is meant to optimize the writes to 
>>>> x2APIC
>>>> ICR that eventually leads to a virtual IPI in fixed delivery mode, is 
>>>> getting
>>>> invoked from the quiescent state.
>> Could you try latest linux-next tree? I guess maybe some patches are
>> pending in linux-next tree, I can't reproduce against linux-next tree.
> Sure, I will try this today.

Hi Wanpeng,

I am not seeing the issue getting reproduced with the linux-next tree.
Although, I am still seeing a Warning stack trace:

[  139.220080] RIP: 0010:kvm_arch_vcpu_ioctl_run+0xb57/0x1320 [kvm]
[  139.226837] Code: e8 03 0f b6 04 18 84 c0 74 06 0f 8e 4a 03 00 00 41 c6 85 48
31 00 00 00 e9 24 f8 ff ff 4c 89 ef e8 7e ac 02 00 e9 3d f8 ff ff <0f> 0b e9 f2
f8 ff ff 48f
[  139.247828] RSP: 0018:8889bc397cb8 EFLAGS: 00010202
[  139.253700] RAX: 0001 RBX: dc00 RCX: c1fc3bef
[  139.261695] RDX:  RSI: 0001 RDI: 888f0fa1a8a0
[  139.269692] RBP: 8889bc397d18 R08: ed113786a7d0 R09: ed113786a7d0
[  139.277686] R10: 8889bc353e7f R11: ed113786a7cf R12: 8889bc35423c
[  139.285682] R13: 8889bc353e40 R14: 8889bc353e6c R15: 88897f536000
[  139.293678] FS:  7f3d8a71c700() GS:888a3c40()
knlGS:
[  139.302742] CS:  0010 DS:  ES:  CR0: 80050033
[  139.309186] CR2:  CR3: 0009bc34c004 CR4: 003726e0
[  139.317180] Call Trace:
[  139.320002]  kvm_vcpu_ioctl+0x3ee/0xb10 [kvm]
[  139.324907]  ? sched_clock+0x5/0x10
[  139.328875]  ? kvm_io_bus_get_dev+0x1c0/0x1c0 [kvm]
[  139.334375]  ? ioctl_file_clone+0x120/0x120
[  139.339079]  ? selinux_file_ioctl+0x98/0x570
[  139.343895]  ? selinux_file_mprotect+0x5b0/0x5b0
[  139.349088]  ? irq_matrix_assign+0x360/0x430
[  139.353904]  ? rcu_read_lock_sched_held+0xe0/0xe0
[  139.359201]  ? __fget_files+0x1f0/0x300
[  139.363532]  __x64_sys_ioctl+0x128/0x18e
[  139.367948]  do_syscall_64+0x33/0x40
[  139.371974]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  139.377643] RIP: 0033:0x7f3d98d0a88b

Are you also triggering anything like this in your environment?


>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: WARNING: suspicious RCU usage - while installing a VM on a CPU listed under nohz_full

2020-07-29 Thread Nitesh Narayan Lal

On 7/28/20 10:38 PM, Wanpeng Li wrote:
> Hi Nitesh,
> On Wed, 29 Jul 2020 at 09:00, Wanpeng Li  wrote:
>> On Tue, 28 Jul 2020 at 22:40, Nitesh Narayan Lal  wrote:
>>> Hi,
>>>
>>> I have recently come across an RCU trace with the 5.8-rc7 kernel that has 
>>> the
>>> debug configs enabled while installing a VM on a CPU that is listed under
>>> nohz_full.
>>>
>>> Based on some of the initial debugging, my impression is that the issue is
>>> triggered because of the fastpath that is meant to optimize the writes to 
>>> x2APIC
>>> ICR that eventually leads to a virtual IPI in fixed delivery mode, is 
>>> getting
>>> invoked from the quiescent state.
> Could you try latest linux-next tree? I guess maybe some patches are
> pending in linux-next tree, I can't reproduce against linux-next tree.

Sure, I will try this today.

>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


WARNING: suspicious RCU usage - while installing a VM on a CPU listed under nohz_full

2020-07-28 Thread Nitesh Narayan Lal
Hi,

I have recently come across an RCU trace with the 5.8-rc7 kernel that has the
debug configs enabled while installing a VM on a CPU that is listed under
nohz_full.

Based on some of the initial debugging, my impression is that the issue is
triggered because of the fastpath that is meant to optimize the writes to x2APIC
ICR that eventually leads to a virtual IPI in fixed delivery mode, is getting
invoked from the quiescent state.

Following is the RCU trace dump that I was getting:

[  178.109535] =
[  178.114027] WARNING: suspicious RCU usage
[  178.118518] 5.8.0-rc7-upstream-+ #10 Not tainted
[  178.123685] -
[  178.128176] arch/x86/kvm/lapic.c:269 suspicious rcu_dereference_check() 
usage!
[  178.136254]
[  178.136254] other info that might help us debug this:
[  178.136254]
[  178.145205]
[  178.145205] rcu_scheduler_active = 2, debug_locks = 1
[  178.152506] 1 lock held by CPU 0/KVM/2959:
[  178.157091]  #0: c9000717b6f8 (&kvm->arch.apic_map_lock){+.+.}-{3:3}, at:
kvm_recalculate_apic_map+0x8b/0xdd0 [kvm]
[  178.169207]
[  178.169207] stack backtrace:
[  178.174086] CPU: 18 PID: 2959 Comm: CPU 0/KVM Not tainted
5.8.0-rc7-upstream-+ #10
[  178.182539] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.6.0 
10/31/2017
[  178.190895] Call Trace:
[  178.193637]  dump_stack+0x9d/0xe0
[  178.197379]  kvm_recalculate_apic_map+0x8ce/0xdd0 [kvm]
[  178.203259]  ? kvm_lapic_reset+0x832/0xe50 [kvm]
[  178.208459]  kvm_vcpu_reset+0x28/0x7b0 [kvm]
[  178.213270]  kvm_arch_vcpu_create+0x830/0xb70 [kvm]
[  178.218759]  kvm_vm_ioctl+0x11b1/0x1fe0 [kvm]
[  178.223635]  ? mark_lock+0x144/0x19e0
[  178.227757]  ? kvm_unregister_device_ops+0xe0/0xe0 [kvm]
[  178.233698]  ? sched_clock+0x5/0x10
[  178.237597]  ? sched_clock_cpu+0x18/0x1d0
[  178.242087]  ? __lock_acquire+0xcf6/0x5010
[  178.246686]  ? lockdep_hardirqs_on_prepare+0x550/0x550
[  178.252429]  ? lockdep_hardirqs_on_prepare+0x550/0x550
[  178.258177]  ? sched_clock+0x5/0x10
[  178.262074]  ? sched_clock_cpu+0x18/0x1d0
[  178.266556]  ? find_held_lock+0x3a/0x1c0
[  178.270953]  ? ioctl_file_clone+0x120/0x120
[  178.275630]  ? selinux_file_ioctl+0x98/0x570
[  178.280405]  ? selinux_file_mprotect+0x5b0/0x5b0
[  178.285569]  ? rcu_tasks_wait_gp+0x6d1/0xa50
[  178.290342]  ? rcu_read_lock_sched_held+0xe0/0xe0
[  178.295608]  ? __fget_files+0x1f0/0x300
[  178.299912]  ksys_ioctl+0xc0/0x110
[  178.303719]  __x64_sys_ioctl+0x6f/0xb0
[  178.307913]  do_syscall_64+0x51/0xb0
[  178.311913]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.317557] RIP: 0033:0x7f6b9700d88b
[  178.321551] Code: 0f 1e fa 48 8b 05 fd 95 2c 00 64 c7 00 26 00 00 00 48 c7 c0
ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0
ff ff 73 018
[  178.342513] RSP: 002b:7f6b8cbe9668 EFLAGS: 0246 ORIG_RAX:
0010
[  178.350967] RAX: ffda RBX: 55e8162d9000 RCX: 7f6b9700d88b
[  178.358935] RDX:  RSI: ae41 RDI: 000e
[  178.366903] RBP: 55e8162d9000 R08: 55e8155ec4d0 R09: 55e8162d9000
[  178.374871] R10: 55e815d94ee0 R11: 0246 R12: 55e8162ad420
[  178.382838] R13: 55e8162d9000 R14: 7ffedf043660 R15: 7f6b8cbe9800
[  182.771858]
[  182.773606] =
[  182.778084] WARNING: suspicious RCU usage
[  182.782564] 5.8.0-rc7-upstream-+ #10 Not tainted
[  182.787719] -
[  182.792197] arch/x86/include/asm/trace/fpu.h:60 suspicious
rcu_dereference_check() usage!
[  182.801329]
[  182.801329] other info that might help us debug this:
[  182.801329]
[  182.810268]
[  182.810268] RCU used illegally from idle CPU!
[  182.810268] rcu_scheduler_active = 2, debug_locks = 1
[  182.822407] RCU used illegally from extended quiescent state!
[  182.828824] 1 lock held by CPU 0/KVM/2959:
[  182.833397]  #0: 88903f8500d0 (&vcpu->mutex){+.+.}-{3:3}, at:
kvm_vcpu_ioctl+0x172/0xb00 [kvm]
[  182.838308]
[  182.838308] stack backtrace:
[  182.838313] CPU: 18 PID: 2959 Comm: CPU 0/KVM Tainted: G    W
5.8.0-rc7-upstream-+ #10
[  182.838316] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.6.0 
10/31/2017
[  182.838318] Call Trace:
[  182.874318]  dump_stack+0x9d/0xe0
[  182.878024]  switch_fpu_return+0x37c/0x410
[  182.882602]  ? fpu__clear+0x1a0/0x1a0
[  182.886700]  ? rcu_dynticks_eqs_enter+0x15/0x30
[  182.891807]  vcpu_enter_guest+0x1854/0x3df0 [kvm]
[  182.897121]  ? kvm_vcpu_reload_apic_access_page+0x60/0x60 [kvm]
[  182.903738]  ? lock_acquire+0x1ac/0xac0
[  182.908062]  ? kvm_arch_vcpu_ioctl_run+0x1dc/0x13c0 [kvm]
[  182.914107]  ? rcu_read_unlock+0x50/0x50
[  182.918489]  ? rcu_read_lock_sched_held+0xaf/0xe0
[  182.923788]  ? kvm_load_guest_fpu+0x94/0x350 [kvm]
[  182.929177]  ? kvm_load_guest_fpu+0x94/0x350 [kvm]
[  182.934528]  ? __local_bh_enable_ip+0x123/0x1a0
[  182.939635]  kvm_arch_vcpu_ioctl_run+0x310/0x13c0 [kvm]
[  182.945529]  kvm_vcpu

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-30 Thread Nitesh Narayan Lal

On 6/30/20 8:32 PM, Andrew Morton wrote:
> On Mon, 29 Jun 2020 12:11:25 -0400 Nitesh Narayan Lal  
> wrote:
>
>> On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
>>> From: Alex Belits 
>>>
>>> The current implementation of cpumask_local_spread() does not respect the
>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>> it will return it to the caller for pinning of its IRQ threads. Having
>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>> overhead.
>>>
>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>> available housekeeping CPUs.
>>>
>>> Signed-off-by: Alex Belits 
>>> Signed-off-by: Nitesh Narayan Lal 
>> Hi Peter,
>>
>> I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread 
>> is
>> lying in linux-next.
>> Should I do a re-post by re-basing the patches on the top of linux-next?
>>
>> [1]
>> https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshao...@hisilicon.com/
> This patch has had some review difficulties and has been pending for
> quite some time.  I suggest you base your work on mainline and that we
> ask Yuqi jin to rebase on that, if I don't feel confident doing it,
>

I see, in that case, it should be fine to go ahead with this patch-set as I
already based this on top of the latest master before posting.

-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-29 Thread Nitesh Narayan Lal

On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
> From: Alex Belits 
>
> The current implementation of cpumask_local_spread() does not respect the
> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> it will return it to the caller for pinning of its IRQ threads. Having
> these unwanted IRQ threads on an isolated CPU adds up to a latency
> overhead.
>
> Restrict the CPUs that are returned for spreading IRQs only to the
> available housekeeping CPUs.
>
> Signed-off-by: Alex Belits 
> Signed-off-by: Nitesh Narayan Lal 

Hi Peter,

I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
lying in linux-next.
Should I do a re-post by re-basing the patches on the top of linux-next?

[1]
https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshao...@hisilicon.com/

> ---
>  lib/cpumask.c | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index fb22fb266f93..85da6ab4fbb5 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /**
>   * cpumask_next - get the next cpu in a cpumask
> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>   */
>  unsigned int cpumask_local_spread(unsigned int i, int node)
>  {
> - int cpu;
> + int cpu, hk_flags;
> + const struct cpumask *mask;
>  
> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> + mask = housekeeping_cpumask(hk_flags);
>   /* Wrap: we always want a cpu. */
> - i %= num_online_cpus();
> + i %= cpumask_weight(mask);
>  
>   if (node == NUMA_NO_NODE) {
> - for_each_cpu(cpu, cpu_online_mask)
> + for_each_cpu(cpu, mask) {
>   if (i-- == 0)
>   return cpu;
> + }
>   } else {
>   /* NUMA first. */
> - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>   if (i-- == 0)
>   return cpu;
> + }
>  
> - for_each_cpu(cpu, cpu_online_mask) {
> + for_each_cpu(cpu, mask) {
>   /* Skip NUMA nodes, done above. */
>   if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>   continue;
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v1] i40e: limit the msix vectors based on housekeeping CPUs

2020-06-26 Thread Nitesh Narayan Lal

On 6/16/20 4:03 AM, Christoph Hellwig wrote:
> On Mon, Jun 15, 2020 at 04:21:25PM -0400, Nitesh Narayan Lal wrote:
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>> +mask = housekeeping_cpumask(hk_flags);
>> +cpus = cpumask_weight(mask);
> Code like this has no business inside a driver.  Please provide a
> proper core API for it instead.  Also please wire up
> pci_alloc_irq_vectors* to use this API as well.
>

Hi Christoph,

I have been looking into using nr_houskeeping_* API that I will be defining
within pci_alloc_irq_vectors* to limit the nr of vectors.
However, I am wondering about a few things:

- Some of the drivers such as i40e until now, use the num_online CPUs to
  restrict the number of vectors that they should create. Will it make sense if
  I restrict the maximum vectors requested based on
  nr_online/housekeeping_cpus (Though I will have to make sure that the
  min_vecs is always satisfied)?

  The other option would be to check for the total available vectors in all
  online/housekeeping CPUs for limiting the maxvecs, this way will probably be
  more accurate?

- Another thing that I am wondering about is the right way to test this change.

Please let me know if you have any suggestions?

-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


[PATCH v4 0/3] Preventing job distribution to isolated CPUs

2020-06-25 Thread Nitesh Narayan Lal
This patch-set is originated from one of the patches that have been
posted earlier as a part of "Task_isolation" mode [1] patch series
by Alex Belits . There are only a couple of
changes that I am proposing in this patch-set compared to what Alex
has posted earlier.
 
 
Context
===
On a broad level, all three patches that are included in this patch
set are meant to improve the driver/library to respect isolated
CPUs by not pinning any job on it. Not doing so could impact
the latency values in RT use-cases.


Patches
===
* Patch1:
  The first patch is meant to make cpumask_local_spread()
  aware of the isolated CPUs. It ensures that the CPUs that
  are returned by this API only includes housekeeping CPUs.

* Patch2:
  This patch ensures that a probe function that is called
  using work_on_cpu() doesn't run any task on an isolated CPU.

* Patch3:
  This patch makes store_rps_map() aware of the isolated
  CPUs so that rps don't queue any jobs on an isolated CPU. 


Proposed Changes

To fix the above-mentioned issues Alex has used housekeeping_cpumask().
The only changes that I am proposing here are:
- Removing the dependency on CONFIG_TASK_ISOLATION that was proposed by
  Alex. As it should be safe to rely on housekeeping_cpumask()
  even when we don't have any isolated CPUs and we want
  to fall back to using all available CPUs in any of the above scenarios.
- Using both HK_FLAG_DOMAIN and HK_FLAG_WQ in Patch2 & 3, this is
  because we would want the above fixes not only when we have isolcpus but
  also with something like systemd's CPU affinity.


Testing
===
* Patch 1:
  Fix for cpumask_local_spread() is tested by creating VFs, loading
  iavf module and by adding a tracepoint to confirm that only housekeeping
  CPUs are picked when an appropriate profile is set up and all remaining
  CPUs when no CPU isolation is configured.

* Patch 2:
  To test the PCI fix, I hotplugged a virtio-net-pci from qemu console
  and forced its addition to a specific node to trigger the code path that
  includes the proposed fix and verified that only housekeeping CPUs
  are included via tracepoint.

* Patch 3:
  To test the fix in store_rps_map(), I tried configuring an isolated
  CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which
  resulted in 'write error: Invalid argument' error. For the case
  where a non-isolated CPU is writing in rps_cpus the above operation
  succeeded without any error.


Changes from v3[2]:
==
- In patch 1, replaced HK_FLAG_WQ with HK_FLAG_MANAGED_IRQ based on the
  suggestion from Frederic Weisbecker.

Changes from v2[3]:
==
Both the following suggestions are from Peter Zijlstra.
- Patch1: Removed the extra while loop from cpumask_local_spread and fixed
  the code styling issues.
- Patch3: Change to use cpumask_empty() for verifying that the requested
  CPUs are available in the the housekeeping CPUs.

Changes from v1[4]:
==
- Included the suggestions made by Bjorn Helgaas in the commit message.
- Included the 'Reviewed-by' and 'Acked-by' received for Patch-2.


[1] 
https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.ca...@marvell.com/
[2] 
https://patchwork.ozlabs.org/project/linux-pci/cover/20200623192331.215557-1-nit...@redhat.com/
[3] 
https://patchwork.ozlabs.org/project/linux-pci/cover/20200622234510.240834-1-nit...@redhat.com/
[4] 
https://patchwork.ozlabs.org/project/linux-pci/cover/20200610161226.424337-1-nit...@redhat.com/


Alex Belits (3):
  lib: Restrict cpumask_local_spread to houskeeping CPUs
  PCI: Restrict probe functions to housekeeping CPUs
  net: Restrict receive packets queuing to housekeeping CPUs

 drivers/pci/pci-driver.c |  5 -
 lib/cpumask.c| 16 +++-
 net/core/net-sysfs.c | 10 +-
 3 files changed, 24 insertions(+), 7 deletions(-)

-- 



[Patch v4 2/3] PCI: Restrict probe functions to housekeeping CPUs

2020-06-25 Thread Nitesh Narayan Lal
From: Alex Belits 

pci_call_probe() prevents the nesting of work_on_cpu() for a scenario
where a VF device is probed from work_on_cpu() of the PF.

Replace the cpumask used in pci_call_probe() from all online CPUs to only
housekeeping CPUs. This is to ensure that there are no additional latency
overheads caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
Acked-by: Bjorn Helgaas 
Reviewed-by: Frederic Weisbecker 
---
 drivers/pci/pci-driver.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510af1221..449466f71040 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
  const struct pci_device_id *id)
 {
int error, node, cpu;
+   int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
struct drv_dev_and_id ddi = { drv, dev, id };
 
/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
pci_physfn_is_probed(dev))
cpu = nr_cpu_ids;
else
-   cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+   cpu = cpumask_any_and(cpumask_of_node(node),
+ housekeeping_cpumask(hk_flags));
 
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
-- 
2.18.4



[Patch v4 3/3] net: Restrict receive packets queuing to housekeeping CPUs

2020-06-25 Thread Nitesh Narayan Lal
From: Alex Belits 

With the existing implementation of store_rps_map(), packets are queued
in the receive path on the backlog queues of other CPUs irrespective of
whether they are isolated or not. This could add a latency overhead to
any RT workload that is running on the same CPU.

Ensure that store_rps_map() only uses available housekeeping CPUs for
storing the rps_map.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 net/core/net-sysfs.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b822bb15..677868fea316 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
struct rps_map *old_map, *map;
cpumask_var_t mask;
-   int err, cpu, i;
+   int err, cpu, i, hk_flags;
static DEFINE_MUTEX(rps_map_mutex);
 
if (!capable(CAP_NET_ADMIN))
@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
return err;
}
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+   if (cpumask_empty(mask)) {
+   free_cpumask_var(mask);
+   return -EINVAL;
+   }
+
map = kzalloc(max_t(unsigned int,
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
  GFP_KERNEL);
-- 
2.18.4



[Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-25 Thread Nitesh Narayan Lal
From: Alex Belits 

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 lib/cpumask.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..85da6ab4fbb5 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-   int cpu;
+   int cpu, hk_flags;
+   const struct cpumask *mask;
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+   mask = housekeeping_cpumask(hk_flags);
/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   i %= cpumask_weight(mask);
 
if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
+   for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+   }
} else {
/* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+   }
 
-   for_each_cpu(cpu, cpu_online_mask) {
+   for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;
-- 
2.18.4



Re: [Patch v3 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-24 Thread Nitesh Narayan Lal

On 6/24/20 3:26 PM, Andrew Morton wrote:
> On Tue, 23 Jun 2020 15:23:29 -0400 Nitesh Narayan Lal  
> wrote:
>
>> From: Alex Belits 
>>
>> The current implementation of cpumask_local_spread() does not respect the
>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>> it will return it to the caller for pinning of its IRQ threads. Having
>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>> overhead.
>>
>> Restrict the CPUs that are returned for spreading IRQs only to the
>> available housekeeping CPUs.
>>
>> ...
>>
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  /**
>>   * cpumask_next - get the next cpu in a cpumask
>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
>> mask)
>>   */
>>  unsigned int cpumask_local_spread(unsigned int i, int node)
>>  {
>> -int cpu;
>> +int cpu, hk_flags;
>> +const struct cpumask *mask;
>>  
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>> +mask = housekeeping_cpumask(hk_flags);
>>  /* Wrap: we always want a cpu. */
>> -i %= num_online_cpus();
>> +i %= cpumask_weight(mask);
>>  
>>  if (node == NUMA_NO_NODE) {
>> -for_each_cpu(cpu, cpu_online_mask)
>> +for_each_cpu(cpu, mask) {
>>  if (i-- == 0)
>>  return cpu;
>> +}
>>  } else {
>>  /* NUMA first. */
>> -for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
>> +for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>  if (i-- == 0)
>>  return cpu;
>> +}
>>  
>> -for_each_cpu(cpu, cpu_online_mask) {
>> +for_each_cpu(cpu, mask) {
>>  /* Skip NUMA nodes, done above. */
>>  if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>  continue;
> Are you aware of these changes to cpu_local_spread()?
> https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshao...@hisilicon.com/
>
> I don't see a lot of overlap but it would be nice for you folks to
> check each other's homework ;)

I took a look at the patch and as you said there is not much overlap.
The idea of keeping isolated CPUs untouched for RT environments will be valid
for the optimization that Shaokun is suggesting as well.
I am not sure about the current state of the patch-set but I will certainly keep
an eye on it.

>
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v3 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-24 Thread Nitesh Narayan Lal

On 6/24/20 3:26 PM, Andrew Morton wrote:
> On Tue, 23 Jun 2020 15:23:29 -0400 Nitesh Narayan Lal  
> wrote:
>
>> From: Alex Belits 
>>
>> The current implementation of cpumask_local_spread() does not respect the
>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>> it will return it to the caller for pinning of its IRQ threads. Having
>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>> overhead.
>>
>> Restrict the CPUs that are returned for spreading IRQs only to the
>> available housekeeping CPUs.
>>
>> ...
>>
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  /**
>>   * cpumask_next - get the next cpu in a cpumask
>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
>> mask)
>>   */
>>  unsigned int cpumask_local_spread(unsigned int i, int node)
>>  {
>> -int cpu;
>> +int cpu, hk_flags;
>> +const struct cpumask *mask;
>>  
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>> +mask = housekeeping_cpumask(hk_flags);
>>  /* Wrap: we always want a cpu. */
>> -i %= num_online_cpus();
>> +i %= cpumask_weight(mask);
>>  
>>  if (node == NUMA_NO_NODE) {
>> -for_each_cpu(cpu, cpu_online_mask)
>> +for_each_cpu(cpu, mask) {
>>  if (i-- == 0)
>>  return cpu;
>> +}
>>  } else {
>>  /* NUMA first. */
>> -for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
>> +for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>  if (i-- == 0)
>>  return cpu;
>> +}
>>  
>> -for_each_cpu(cpu, cpu_online_mask) {
>> +for_each_cpu(cpu, mask) {
>>  /* Skip NUMA nodes, done above. */
>>  if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>  continue;
> Are you aware of these changes to cpu_local_spread()?
> https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshao...@hisilicon.com/
>
> I don't see a lot of overlap but it would be nice for you folks to
> check each other's homework ;)

Sure, I will take a look.
Thanks

>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v3 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-24 Thread Nitesh Narayan Lal

On 6/24/20 8:13 AM, Frederic Weisbecker wrote:
> On Tue, Jun 23, 2020 at 03:23:29PM -0400, Nitesh Narayan Lal wrote:
>> From: Alex Belits 
>>
>> The current implementation of cpumask_local_spread() does not respect the
>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>> it will return it to the caller for pinning of its IRQ threads. Having
>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>> overhead.
>>
>> Restrict the CPUs that are returned for spreading IRQs only to the
>> available housekeeping CPUs.
>>
>> Signed-off-by: Alex Belits 
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  lib/cpumask.c | 16 +++-
>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>> index fb22fb266f93..d73104995981 100644
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  /**
>>   * cpumask_next - get the next cpu in a cpumask
>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
>> mask)
>>   */
>>  unsigned int cpumask_local_spread(unsigned int i, int node)
>>  {
>> -int cpu;
>> +int cpu, hk_flags;
>> +const struct cpumask *mask;
>>  
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
> This should be HK_FLAG_MANAGED_IRQ instead of HK_FLAG_WQ since this
> function seem to be used mostly to select CPUs to affine managed IRQs.

IIRC then there are drivers such as ixgbe that use cpumask_local_spread while
affining NORMAL IRQs as well.
But I can recheck that.

> In the end the cpumask you pass to IRQ core will be filtered throughout
> HK_FLAG_MANAGED_IRQ anyway so better select an appropriate one in the
> first place to avoid an empty cpumask intersection.
>
> Now even if cpumask_local_spread() is currently mostly used to select
> managed irq targets, the name and role of the function don't refer to that.
> Probably cpumask_local_spread() should take HK_ flag in parameter so that
> it can correctly handle future users?
>
> That being said, I plan to merge HK_FLAG_RCU, HK_FLAG_MISC, HK_FLAG_SCHED,
> HK_FLAG_WQ and HK_FLAG_TIMER into HK_FLAG_UNBOUND since it doesn't make sense
> to divide them all.

That would be nice.

>  And the actual flag used inside cpumask_local_spread()
> could end up being HK_FLAG_DOMAIN | HK_FLAG_UNBOUND. So probably you don't
> need to worry about that and just change the HK_FLAG_WQ in your patch
> with HK_FLAG_MANAGED_IRQ.
>
> Thanks.
>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


[Patch v3 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-23 Thread Nitesh Narayan Lal
From: Alex Belits 

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 lib/cpumask.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..d73104995981 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-   int cpu;
+   int cpu, hk_flags;
+   const struct cpumask *mask;
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   mask = housekeeping_cpumask(hk_flags);
/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   i %= cpumask_weight(mask);
 
if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
+   for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+   }
} else {
/* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+   }
 
-   for_each_cpu(cpu, cpu_online_mask) {
+   for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;
-- 
2.18.4



[Patch v3 3/3] net: Restrict receive packets queuing to housekeeping CPUs

2020-06-23 Thread Nitesh Narayan Lal
From: Alex Belits 

With the existing implementation of store_rps_map(), packets are queued
in the receive path on the backlog queues of other CPUs irrespective of
whether they are isolated or not. This could add a latency overhead to
any RT workload that is running on the same CPU.

Ensure that store_rps_map() only uses available housekeeping CPUs for
storing the rps_map.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 net/core/net-sysfs.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b822bb15..677868fea316 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
struct rps_map *old_map, *map;
cpumask_var_t mask;
-   int err, cpu, i;
+   int err, cpu, i, hk_flags;
static DEFINE_MUTEX(rps_map_mutex);
 
if (!capable(CAP_NET_ADMIN))
@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
return err;
}
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+   if (cpumask_empty(mask)) {
+   free_cpumask_var(mask);
+   return -EINVAL;
+   }
+
map = kzalloc(max_t(unsigned int,
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
  GFP_KERNEL);
-- 
2.18.4



[Patch v3 2/3] PCI: Restrict probe functions to housekeeping CPUs

2020-06-23 Thread Nitesh Narayan Lal
From: Alex Belits 

pci_call_probe() prevents the nesting of work_on_cpu() for a scenario
where a VF device is probed from work_on_cpu() of the PF.

Replace the cpumask used in pci_call_probe() from all online CPUs to only
housekeeping CPUs. This is to ensure that there are no additional latency
overheads caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
Acked-by: Bjorn Helgaas 
Reviewed-by: Frederic Weisbecker 
---
 drivers/pci/pci-driver.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510af1221..449466f71040 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
  const struct pci_device_id *id)
 {
int error, node, cpu;
+   int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
struct drv_dev_and_id ddi = { drv, dev, id };
 
/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
pci_physfn_is_probed(dev))
cpu = nr_cpu_ids;
else
-   cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+   cpu = cpumask_any_and(cpumask_of_node(node),
+ housekeeping_cpumask(hk_flags));
 
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
-- 
2.18.4



[PATCH v3 0/3] Preventing job distribution to isolated CPUs

2020-06-23 Thread Nitesh Narayan Lal
This patch-set is originated from one of the patches that have been
posted earlier as a part of "Task_isolation" mode [1] patch series
by Alex Belits . There are only a couple of
changes that I am proposing in this patch-set compared to what Alex
has posted earlier.
 
 
Context
===
On a broad level, all three patches that are included in this patch
set are meant to improve the driver/library to respect isolated
CPUs by not pinning any job on it. Not doing so could impact
the latency values in RT use-cases.


Patches
===
* Patch1:
  The first patch is meant to make cpumask_local_spread()
  aware of the isolated CPUs. It ensures that the CPUs that
  are returned by this API only includes housekeeping CPUs.

* Patch2:
  This patch ensures that a probe function that is called
  using work_on_cpu() doesn't run any task on an isolated CPU.

* Patch3:
  This patch makes store_rps_map() aware of the isolated
  CPUs so that rps don't queue any jobs on an isolated CPU. 


Proposed Changes

To fix the above-mentioned issues Alex has used housekeeping_cpumask().
The only changes that I am proposing here are:
- Removing the dependency on CONFIG_TASK_ISOLATION that was proposed by
  Alex. As it should be safe to rely on housekeeping_cpumask()
  even when we don't have any isolated CPUs and we want
  to fall back to using all available CPUs in any of the above scenarios.
- Using both HK_FLAG_DOMAIN and HK_FLAG_WQ in all three patches, this is
  because we would want the above fixes not only when we have isolcpus but
  also with something like systemd's CPU affinity.


Testing
===
* Patch 1:
  Fix for cpumask_local_spread() is tested by creating VFs, loading
  iavf module and by adding a tracepoint to confirm that only housekeeping
  CPUs are picked when an appropriate profile is set up and all remaining
  CPUs when no CPU isolation is configured.

* Patch 2:
  To test the PCI fix, I hotplugged a virtio-net-pci from qemu console
  and forced its addition to a specific node to trigger the code path that
  includes the proposed fix and verified that only housekeeping CPUs
  are included via tracepoint.

* Patch 3:
  To test the fix in store_rps_map(), I tried configuring an isolated
  CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which
  resulted in 'write error: Invalid argument' error. For the case
  where a non-isolated CPU is writing in rps_cpus the above operation
  succeeded without any error.


Changes from v2[2]:
==
- Patch1: Removed the extra while loop from cpumask_local_spread and fixed
  the code styling issues.
- Patch3: Change to use cpumask_empty() for verifying that the requested
  CPUs are available in the housekeeping CPUs.

Changes from v1[3]:
==
- Included the suggestions made by Bjorn Helgaas in the commit message.
- Included the 'Reviewed-by' and 'Acked-by' received for Patch-2.


[1] 
https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.ca...@marvell.com/
[2] 
https://patchwork.ozlabs.org/project/linux-pci/cover/20200622234510.240834-1-nit...@redhat.com/
[3] 
https://patchwork.ozlabs.org/project/linux-pci/cover/20200610161226.424337-1-nit...@redhat.com/


Alex Belits (3):
  lib: Restrict cpumask_local_spread to houskeeping CPUs
  PCI: Restrict probe functions to housekeeping CPUs
  net: Restrict receive packets queuing to housekeeping CPUs

 drivers/pci/pci-driver.c |  5 -
 lib/cpumask.c| 16 +++-
 net/core/net-sysfs.c | 10 +-
 3 files changed, 24 insertions(+), 7 deletions(-)

-- 



Re: [Patch v2 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-23 Thread Nitesh Narayan Lal

On 6/23/20 5:21 AM, Peter Zijlstra wrote:
> On Mon, Jun 22, 2020 at 07:45:08PM -0400, Nitesh Narayan Lal wrote:
>> From: Alex Belits 
>>
>> The current implementation of cpumask_local_spread() does not respect the
>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>> it will return it to the caller for pinning of its IRQ threads. Having
>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>> overhead.
>>
>> Restrict the CPUs that are returned for spreading IRQs only to the
>> available housekeeping CPUs.
>>
>> Signed-off-by: Alex Belits 
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
>>  lib/cpumask.c | 43 +--
>>  1 file changed, 25 insertions(+), 18 deletions(-)
>>
>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>> index fb22fb266f93..cc4311a8c079 100644
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  /**
>>   * cpumask_next - get the next cpu in a cpumask
>> @@ -205,28 +206,34 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
>> mask)
>>   */
>>  unsigned int cpumask_local_spread(unsigned int i, int node)
>>  {
>> -int cpu;
>> +int cpu, m, n, hk_flags;
>> +const struct cpumask *mask;
>>  
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>> +mask = housekeeping_cpumask(hk_flags);
>> +m = cpumask_weight(mask);
>>  /* Wrap: we always want a cpu. */
>> -i %= num_online_cpus();
>> +n = i % m;
>> +while (m-- > 0) {
> I are confuzled. What do we need this outer loop for?
>
> Why isn't something like:
>
>   i %= cpumask_weight(mask);
>
> good enough? That voids having to touch the test.

Makes sense.
Thanks

> Still when you're there, at the very least you can fix the horrible
> style:

Sure.

>
>
>> +if (node == NUMA_NO_NODE) {
>> +for_each_cpu(cpu, mask)
>> +if (n-- == 0)
>> +return cpu;
> { }
>
>> +} else {
>> +/* NUMA first. */
>> +for_each_cpu_and(cpu, cpumask_of_node(node), mask)
>> +if (n-- == 0)
>> +return cpu;
> { }
>
>>  
>> +for_each_cpu(cpu, mask) {
>> +/* Skip NUMA nodes, done above. */
>> +if (cpumask_test_cpu(cpu,
>> + cpumask_of_node(node)))
>> +continue;
> No linebreak please.
>
>>  
>> +if (n-- == 0)
>> +return cpu;
>> +}
>>  }
>>  }
>>  BUG();
>> -- 
>> 2.18.4
>>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v2 3/3] net: Restrict receive packets queuing to housekeeping CPUs

2020-06-23 Thread Nitesh Narayan Lal

On 6/23/20 5:23 AM, Peter Zijlstra wrote:
> On Mon, Jun 22, 2020 at 07:45:10PM -0400, Nitesh Narayan Lal wrote:
>> @@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue 
>> *queue,
>>  return err;
>>  }
>>  
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>> +cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
>> +if (cpumask_weight(mask) == 0) {
> We have cpumask_empty() for that, which is a much more efficient way of
> testing the same.

Yes, right.
I will make this change.

>
>> +free_cpumask_var(mask);
>> +return -EINVAL;
>> +}
>> +
>>  map = kzalloc(max_t(unsigned int,
>>  RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
>>GFP_KERNEL);
>> -- 
>> 2.18.4
>>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/3] Preventing job distribution to isolated CPUs

2020-06-22 Thread Nitesh Narayan Lal

On 6/22/20 7:45 PM, Nitesh Narayan Lal wrote:
>
> Testing
> ===
> * Patch 1: 
>   Fix for cpumask_local_spread() is tested by creating VFs, loading
>   iavf module and by adding a tracepoint to confirm that only housekeeping 
>   CPUs are picked when an appropriate profile is set up and all remaining  
>   CPUs when no CPU isolation is configured.
>
> * Patch 2: 
>   To test the PCI fix, I hotplugged a virtio-net-pci from qemu console 
>   and forced its addition to a specific node to trigger the code path that 
>   includes the proposed fix and verified that only housekeeping CPUs   
>   are included via tracepoint. 
>
> * Patch 3: 
>   To test the fix in store_rps_map(), I tried configuring an isolated  
>   CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which   
>   resulted in 'write error: Invalid argument' error. For the case  
>   where a non-isolated CPU is writing in rps_cpus the above operation  
>   succeeded without any error. 
>
>
> Changes from v1:   
> ===
> - Included the suggestions made by Bjorn Helgaas in the commit messages.
> - Included the 'Reviewed-by' and 'Acked-by' received for Patch-2.  
>
> [1] 
> https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.ca...@marvell.com/
>
> Alex Belits (3):   
>   lib: Restrict cpumask_local_spread to houskeeping CPUs   
>   PCI: Restrict probe functions to housekeeping CPUs   
>   net: Restrict receive packets queuing to housekeeping CPUs   
>
>  drivers/pci/pci-driver.c |  5 -   
>  lib/cpumask.c| 43 +++-
>  net/core/net-sysfs.c | 10 +-  
>  3 files changed, 38 insertions(+), 20 deletions(-)
>
> --
>

Hi,

It seems that the cover email got messed up while I was sending the patches.
I am putting my intended cover-email below for now. I can send a v3 with proper
cover-email if needed. The reason, I am not sending it right now, is that if I
get some comments in my patches I will prefer including them as well in my
v3 posting.


"
This patch-set is originated from one of the patches that have been
posted earlier as a part of "Task_isolation" mode [1] patch series
by Alex Belits . There are only a couple of
changes that I am proposing in this patch-set compared to what Alex
has posted earlier.


Context
===
On a broad level, all three patches that are included in this patch
set are meant to improve the driver/library to respect isolated
CPUs by not pinning any job on it. Not doing so could impact
the latency values in RT use-cases.


Patches
===
* Patch1:
  The first patch is meant to make cpumask_local_spread()
  aware of the isolated CPUs. It ensures that the CPUs that
  are returned by this API only includes housekeeping CPUs.

* Patch2:
  This patch ensures that a probe function that is called
  using work_on_cpu() doesn't run any task on an isolated CPU.

* Patch3:
  This patch makes store_rps_map() aware of the isolated
  CPUs so that rps don't queue any jobs on an isolated CPU.


Proposed Changes

To fix the above-mentioned issues Alex has used housekeeping_cpumask().
The only changes that I am proposing here are:
- Removing the dependency on CONFIG_TASK_ISOLATION that was proposed by
  Alex. As it should be safe to rely on housekeeping_cpuma

[PATCH v2 0/3] Preventing job distribution to isolated CPUs

2020-06-22 Thread Nitesh Narayan Lal
   
Testing
===
* Patch 1: 
  Fix for cpumask_local_spread() is tested by creating VFs, loading
  iavf module and by adding a tracepoint to confirm that only housekeeping 
  CPUs are picked when an appropriate profile is set up and all remaining  
  CPUs when no CPU isolation is configured.
   
* Patch 2: 
  To test the PCI fix, I hotplugged a virtio-net-pci from qemu console 
  and forced its addition to a specific node to trigger the code path that 
  includes the proposed fix and verified that only housekeeping CPUs   
  are included via tracepoint. 
   
* Patch 3: 
  To test the fix in store_rps_map(), I tried configuring an isolated  
  CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which   
  resulted in 'write error: Invalid argument' error. For the case  
  where a non-isolated CPU is writing in rps_cpus the above operation  
  succeeded without any error. 
   
   
Changes from v1:   
===
- Included the suggestions made by Bjorn Helgaas in the commit messages.
- Included the 'Reviewed-by' and 'Acked-by' received for Patch-2.  
   
[1] 
https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.ca...@marvell.com/
   
Alex Belits (3):   
  lib: Restrict cpumask_local_spread to houskeeping CPUs   
  PCI: Restrict probe functions to housekeeping CPUs   
  net: Restrict receive packets queuing to housekeeping CPUs   
   
 drivers/pci/pci-driver.c |  5 -   
 lib/cpumask.c| 43 +++-
 net/core/net-sysfs.c | 10 +-  
 3 files changed, 38 insertions(+), 20 deletions(-)
   
--



[Patch v2 3/3] net: Restrict receive packets queuing to housekeeping CPUs

2020-06-22 Thread Nitesh Narayan Lal
From: Alex Belits 

With the existing implementation of store_rps_map(), packets are queued
in the receive path on the backlog queues of other CPUs irrespective of
whether they are isolated or not. This could add a latency overhead to
any RT workload that is running on the same CPU.

Ensure that store_rps_map() only uses available housekeeping CPUs for
storing the rps_map.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 net/core/net-sysfs.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b822bb15..16e433287191 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
struct rps_map *old_map, *map;
cpumask_var_t mask;
-   int err, cpu, i;
+   int err, cpu, i, hk_flags;
static DEFINE_MUTEX(rps_map_mutex);
 
if (!capable(CAP_NET_ADMIN))
@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
return err;
}
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+   if (cpumask_weight(mask) == 0) {
+   free_cpumask_var(mask);
+   return -EINVAL;
+   }
+
map = kzalloc(max_t(unsigned int,
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
  GFP_KERNEL);
-- 
2.18.4



[Patch v2 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-06-22 Thread Nitesh Narayan Lal
From: Alex Belits 

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 lib/cpumask.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..cc4311a8c079 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,28 +206,34 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-   int cpu;
+   int cpu, m, n, hk_flags;
+   const struct cpumask *mask;
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   mask = housekeeping_cpumask(hk_flags);
+   m = cpumask_weight(mask);
/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   n = i % m;
+   while (m-- > 0) {
+   if (node == NUMA_NO_NODE) {
+   for_each_cpu(cpu, mask)
+   if (n-- == 0)
+   return cpu;
+   } else {
+   /* NUMA first. */
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask)
+   if (n-- == 0)
+   return cpu;
 
-   if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
-   if (i-- == 0)
-   return cpu;
-   } else {
-   /* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
-   if (i-- == 0)
-   return cpu;
+   for_each_cpu(cpu, mask) {
+   /* Skip NUMA nodes, done above. */
+   if (cpumask_test_cpu(cpu,
+cpumask_of_node(node)))
+   continue;
 
-   for_each_cpu(cpu, cpu_online_mask) {
-   /* Skip NUMA nodes, done above. */
-   if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
-   continue;
-
-   if (i-- == 0)
-   return cpu;
+   if (n-- == 0)
+   return cpu;
+   }
}
}
BUG();
-- 
2.18.4



[Patch v2 2/3] PCI: Restrict probe functions to housekeeping CPUs

2020-06-22 Thread Nitesh Narayan Lal
From: Alex Belits 

pci_call_probe() prevents the nesting of work_on_cpu() for a scenario
where a VF device is probed from work_on_cpu() of the PF.

Replace the cpumask used in pci_call_probe() from all online CPUs to only
housekeeping CPUs. This is to ensure that there are no additional latency
overheads caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
Acked-by: Bjorn Helgaas 
Reviewed-by: Frederic Weisbecker 
---
 drivers/pci/pci-driver.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510af1221..449466f71040 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
  const struct pci_device_id *id)
 {
int error, node, cpu;
+   int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
struct drv_dev_and_id ddi = { drv, dev, id };
 
/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
pci_physfn_is_probed(dev))
cpu = nr_cpu_ids;
else
-   cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+   cpu = cpumask_any_and(cpumask_of_node(node),
+ housekeeping_cpumask(hk_flags));
 
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
-- 
2.18.4



Re: [Patch v1 2/3] PCI: prevent work_on_cpu's probe to execute on isolated CPUs

2020-06-16 Thread Nitesh Narayan Lal

On 6/16/20 4:05 PM, Bjorn Helgaas wrote:
> "git log --oneline drivers/pci/pci-driver.c" tells you that the
> subject should be something like:
>
>   PCI: Restrict probe functions to housekeeping CPUs
>
> On Wed, Jun 10, 2020 at 12:12:25PM -0400, Nitesh Narayan Lal wrote:
>> From: Alex Belits 
>>
>> pci_call_probe() prevents the nesting of work_on_cpu()
>> for a scenario where a VF device is probed from work_on_cpu()
>> of the Physical device.
>> This patch replaces the cpumask used in pci_call_probe()
>> from all online CPUs to only housekeeping CPUs. This is to
>> ensure that there are no additional latency overheads
>> caused due to the pinning of jobs on isolated CPUs.
> s/Physical/PF/ (since you used "VF" earlier, this should match that)
>
> s/This patch replaces/Replace the/
>
> Please rewrap this to fill a 75 column line (so it doesn't overflow 80
> columns when "git log" adds 4 spaces).
>
> This should be two paragraphs; add a blank line between them.

Thanks for pointing these out.
I will correct it in the next posting before that I will wait for any comments
on other patches.

>
>> Signed-off-by: Alex Belits 
>> Signed-off-by: Nitesh Narayan Lal 
> Acked-by: Bjorn Helgaas 
>
>> ---
>>  drivers/pci/pci-driver.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index da6510af1221..449466f71040 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -12,6 +12,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
>> pci_dev *dev,
>>const struct pci_device_id *id)
>>  {
>>  int error, node, cpu;
>> +int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>>  struct drv_dev_and_id ddi = { drv, dev, id };
>>  
>>  /*
>> @@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct 
>> pci_dev *dev,
>>  pci_physfn_is_probed(dev))
>>  cpu = nr_cpu_ids;
>>  else
>> -cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
>> +cpu = cpumask_any_and(cpumask_of_node(node),
>> +  housekeeping_cpumask(hk_flags));
>>  
>>  if (cpu < nr_cpu_ids)
>>  error = work_on_cpu(cpu, local_pci_probe, &ddi);
>> -- 
>> 2.18.4
>>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v1] i40e: limit the msix vectors based on housekeeping CPUs

2020-06-16 Thread Nitesh Narayan Lal

On 6/16/20 4:03 AM, Christoph Hellwig wrote:
> On Mon, Jun 15, 2020 at 04:21:25PM -0400, Nitesh Narayan Lal wrote:
>> +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>> +mask = housekeeping_cpumask(hk_flags);
>> +cpus = cpumask_weight(mask);
> Code like this has no business inside a driver.  Please provide a
> proper core API for it instead. 

Ok, I will think of a better way of doing this.

>  Also please wire up
> pci_alloc_irq_vectors* to use this API as well.

Understood, I will include this in a separate patch.

>
-- 
Thanks
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [Patch v1] i40e: limit the msix vectors based on housekeeping CPUs

2020-06-15 Thread Nitesh Narayan Lal

On 6/15/20 4:48 PM, Keller, Jacob E wrote:
>
>> -Original Message-
>> From: Nitesh Narayan Lal 
>> Sent: Monday, June 15, 2020 1:21 PM
>> To: linux-kernel@vger.kernel.org; frede...@kernel.org; mtosa...@redhat.com;
>> sassm...@redhat.com; Kirsher, Jeffrey T ; 
>> Keller,
>> Jacob E ; jle...@redhat.com
>> Subject: [Patch v1] i40e: limit the msix vectors based on housekeeping CPUs
>>
>> In a realtime environment, it is essential to isolate
>> unwanted IRQs from isolated CPUs to prevent latency overheads.
>> Creating MSIX vectors only based on the online CPUs could lead
>> to a potential issue on an RT setup that has several isolated
>> CPUs but a very few housekeeping CPUs. This is because in these
>> kinds of setups an attempt to move the IRQs to the limited
>> housekeeping CPUs from isolated CPUs might fail due to the per
>> CPU vector limit. This could eventually result in latency spikes
>> because of the IRQ threads that we fail to move from isolated
>> CPUs. This patch prevents i40e to add vectors only based on
>> available online CPUs by using housekeeping_cpumask() to derive
>> the number of available housekeeping CPUs.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>> ---
> Ok, so the idea is that "housekeeping" CPUs are to be used for general 
> purpose configuration, and thus is a subset of online CPUs. By reducing the 
> limit to just housekeeping CPUs, we ensure that we do not overload the system 
> with more queues than can be handled by the general purpose CPUs?

Yes.
General purpose or the housekeeping CPUs or the non-isolated CPUs.

>
> Thanks,
> Jake
>
-- 
Nitesh



signature.asc
Description: OpenPGP digital signature


[PATCH v1 0/1] limit the i40e msix vectors based on housekeeping CPUs

2020-06-15 Thread Nitesh Narayan Lal
Issue
=
With the current implementation at the time of i40e_init_msix(), i40e
creates vectors only based on the number of online CPUs. This would
be problematic for RT setup that includes a large number of isolated
but very few housekeeping CPUs. This is because in those setups
an attempt to move all IRQs from isolated to housekeeping CPUs may
easily fail due to per CPU vector limit.

Setup For The Issue
===
I have triggered this issue on a setup that had a total of 72
cores among which 68 were isolated and only 4 were left for
housekeeping tasks. I was using tuned's realtime-virtual-host profile
to configure the system. However, Tuned reported the error message
'Failed to set SMP affinity of IRQ xxx to '0040,0010,0005':
[Errno 28] No space left on the device' for several IRQs in tuned.log.
Note: There were other IRQs as well pinned to the housekeeping CPUs that
  were generated by other drivers.

Fix
===
- In this proposed fix I have replaced num_online_cpus in i40e_init_msix()
  with the number of housekeeping CPUs.
- The reason why I chose to include both HK_FLAG_DOMAIN & HK_FLAG_WQ is
  because we would also need IRQ isolation with something like systemd's
  CPU affinity.


Testing
===
To test this change I had added a tracepoint in i40e_init_msix() to
find the number of CPUs derived for vector creation with and without
tuned's realtime-virtual-host profile. As per expectation with the profile
applied I was only getting the number of housekeeping CPUs and all
available CPUs without it.


Nitesh Narayan Lal (1):
  i40e: limit the msix vectors based on housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

-- 



[Patch v1] i40e: limit the msix vectors based on housekeeping CPUs

2020-06-15 Thread Nitesh Narayan Lal
In a realtime environment, it is essential to isolate
unwanted IRQs from isolated CPUs to prevent latency overheads.
Creating MSIX vectors only based on the online CPUs could lead
to a potential issue on an RT setup that has several isolated
CPUs but a very few housekeeping CPUs. This is because in these
kinds of setups an attempt to move the IRQs to the limited
housekeeping CPUs from isolated CPUs might fail due to the per
CPU vector limit. This could eventually result in latency spikes
because of the IRQ threads that we fail to move from isolated
CPUs. This patch prevents i40e to add vectors only based on
available online CPUs by using housekeeping_cpumask() to derive
the number of available housekeeping CPUs.

Signed-off-by: Nitesh Narayan Lal 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5d807c8004f8..9691bececb86 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Local includes */
 #include "i40e.h"
@@ -10933,11 +10934,13 @@ static int i40e_reserve_msix_vectors(struct i40e_pf 
*pf, int vectors)
 static int i40e_init_msix(struct i40e_pf *pf)
 {
struct i40e_hw *hw = &pf->hw;
+   const struct cpumask *mask;
int cpus, extra_vectors;
int vectors_left;
int v_budget, i;
int v_actual;
int iwarp_requested = 0;
+   int hk_flags;
 
if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
return -ENODEV;
@@ -10968,12 +10971,15 @@ static int i40e_init_msix(struct i40e_pf *pf)
 
/* reserve some vectors for the main PF traffic queues. Initially we
 * only reserve at most 50% of the available vectors, in the case that
-* the number of online CPUs is large. This ensures that we can enable
-* extra features as well. Once we've enabled the other features, we
-* will use any remaining vectors to reach as close as we can to the
-* number of online CPUs.
+* the number of online (housekeeping) CPUs is large. This ensures that
+* we can enable extra features as well. Once we've enabled the other
+* features, we will use any remaining vectors to reach as close as we
+* can to the number of online (housekeeping) CPUs.
 */
-   cpus = num_online_cpus();
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   mask = housekeeping_cpumask(hk_flags);
+   cpus = cpumask_weight(mask);
+
pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
vectors_left -= pf->num_lan_msix;
 
-- 
2.18.4



[PATCH v1 0/3] Preventing job distribution to isolated CPUs

2020-06-10 Thread Nitesh Narayan Lal
This patch-set is originated from one of the patches that have been
posted earlier as a part of "Task_isolation" mode [1] patch series
by Alex Belits . There are only a couple of
changes that I am proposing in this patch-set compared to what Alex
has posted earlier.


Context
===
On a broad level, all three patches that are included in this patch
set are meant to improve the driver/library to respect isolated
CPUs by not pinning any job on it. Not doing so could impact
the latency values in RT use-cases.


Patches
===
* Patch1:
  The first patch is meant to make cpumask_local_spread()
  aware of the isolated CPUs. It ensures that the CPUs that
  are returned by this API only includes housekeeping CPUs.

* Patch2:
  This patch ensures that a probe function that is called
  using work_on_cpu() doesn't run any task on an isolated CPU.

* Patch3:
  This patch makes store_rps_map() aware of the isolated
  CPUs so that rps don't queue any jobs on an isolated CPU.


Changes
===
To fix the above-mentioned issues Alex has used housekeeping_cpumask().
The only changes that I am proposing here are:
- Removing the dependency on CONFIG_TASK_ISOLATION that was proposed by Alex.
  As it should be safe to rely on housekeeping_cpumask()
  even when we don't have any isolated CPUs and we want
  to fall back to using all available CPUs in any of the above scenarios.
- Using both HK_FLAG_DOMAIN and HK_FLAG_WQ in all three patches, this is
  because we would want the above fixes not only when we have isolcpus but
  also with something like systemd's CPU affinity.


Testing
===
* Patch 1:
  Fix for cpumask_local_spread() is tested by creating VFs, loading
  iavf module and by adding a tracepoint to confirm that only housekeeping
  CPUs are picked when an appropriate profile is set up and all remaining CPUs
  when no CPU isolation is required/configured.

* Patch 2:
  To test the PCI fix, I hotplugged a virtio-net-pci from qemu console
  and forced its addition to a specific node to trigger the code path that
  includes the proposed fix and verified that only housekeeping CPUs
  are included via tracepoint. I understand that this may not be the
  best way to test it, hence, I am open to any suggestion to test this
  fix in a better way if required.

* Patch 3:
  To test the fix in store_rps_map(), I tried configuring an isolated
  CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which
  resulted in 'write error: Invalid argument' error. For the case
  where a non-isolated CPU is writing in rps_cpus the above operation
  succeeded without any error.

[1] 
https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.ca...@marvell.com/

Alex Belits (3):
  lib: restricting cpumask_local_spread to only houskeeping CPUs
  PCI: prevent work_on_cpu's probe to execute on isolated CPUs
  net: restrict queuing of receive packets to housekeeping CPUs

 drivers/pci/pci-driver.c |  5 -
 lib/cpumask.c| 43 +++-
 net/core/net-sysfs.c | 10 +-
 3 files changed, 38 insertions(+), 20 deletions(-)

--  




[Patch v1 1/3] lib: restricting cpumask_local_spread to only houskeeping CPUs

2020-06-10 Thread Nitesh Narayan Lal
From: Alex Belits 

The current implementation of cpumask_local_spread() does not
respect the isolated CPUs, i.e., even if a CPU has been isolated
for Real-Time task, it will return it to the caller for pinning
of its IRQ threads. Having these unwanted IRQ threads on an
isolated CPU adds up to a latency overhead.
This patch restricts the CPUs that are returned for spreading
IRQs only to the available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 lib/cpumask.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..cc4311a8c079 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,28 +206,34 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-   int cpu;
+   int cpu, m, n, hk_flags;
+   const struct cpumask *mask;
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   mask = housekeeping_cpumask(hk_flags);
+   m = cpumask_weight(mask);
/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   n = i % m;
+   while (m-- > 0) {
+   if (node == NUMA_NO_NODE) {
+   for_each_cpu(cpu, mask)
+   if (n-- == 0)
+   return cpu;
+   } else {
+   /* NUMA first. */
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask)
+   if (n-- == 0)
+   return cpu;
 
-   if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
-   if (i-- == 0)
-   return cpu;
-   } else {
-   /* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
-   if (i-- == 0)
-   return cpu;
+   for_each_cpu(cpu, mask) {
+   /* Skip NUMA nodes, done above. */
+   if (cpumask_test_cpu(cpu,
+cpumask_of_node(node)))
+   continue;
 
-   for_each_cpu(cpu, cpu_online_mask) {
-   /* Skip NUMA nodes, done above. */
-   if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
-   continue;
-
-   if (i-- == 0)
-   return cpu;
+   if (n-- == 0)
+   return cpu;
+   }
}
}
BUG();
-- 
2.18.4



[Patch v1 2/3] PCI: prevent work_on_cpu's probe to execute on isolated CPUs

2020-06-10 Thread Nitesh Narayan Lal
From: Alex Belits 

pci_call_probe() prevents the nesting of work_on_cpu()
for a scenario where a VF device is probed from work_on_cpu()
of the Physical device.
This patch replaces the cpumask used in pci_call_probe()
from all online CPUs to only housekeeping CPUs. This is to
ensure that there are no additional latency overheads
caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
 drivers/pci/pci-driver.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510af1221..449466f71040 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
  const struct pci_device_id *id)
 {
int error, node, cpu;
+   int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
struct drv_dev_and_id ddi = { drv, dev, id };
 
/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
pci_physfn_is_probed(dev))
cpu = nr_cpu_ids;
else
-   cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+   cpu = cpumask_any_and(cpumask_of_node(node),
+ housekeeping_cpumask(hk_flags));
 
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
-- 
2.18.4



  1   2   3   >