Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-22 Thread Thomas Gleixner
On Wed, 21 Aug 2019, Keith Busch wrote:
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei  wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 
> > > --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
> > >  --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test 
> > > --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), 
> > so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, 
> > CPU
> > context switch is increased a lot.
> 
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Which is not surprising because the thread and the hard interrupt are on
the same CPU and there is just that little overhead of the context switch.

The thing is that this really depends on how the scheduler decides to place
the interrupt thread.

If you have a queue for several CPUs, then depending on the load situation
allowing a multi-cpu affinity for the thread can cause lots of task
migration.

But restricting the irq thread to the CPU on which the interrupt is affine
can also starve that CPU. There is no universal rule for that.

Tracing should tell.

Thanks,

tglx





Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread Ming Lei
On Thu, Aug 22, 2019 at 10:00 AM Keith Busch  wrote:
>
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei  wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 
> > > --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
> > >  --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test 
> > > --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), 
> > so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, 
> > CPU
> > context switch is increased a lot.
>
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Not investigate it yet.

BTW, my fio test is only done on the single hw queue via 'taskset -c
$cpu_list_of_the_queue',
without applying the threaded interrupt affinity patch. NVMe is Optane.

The same issue can be reproduced after I force to use 1:1 mapping via passing
'possible_cpus=32' kernel cmd line.

Maybe related with kernel options, so attache the one I used, and
basically it is
a subset of RHEL8 kernel.

Thanks,
Ming Lei


config.tar.gz
Description: application/gzip


Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread Keith Busch
On Wed, Aug 21, 2019 at 7:34 PM Ming Lei  wrote:
> On Wed, Aug 21, 2019 at 04:27:00PM +, Long Li wrote:
> > Here is the command to benchmark it:
> >
> > fio --bs=4k --ioengine=libaio --iodepth=128 
> > --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
> >  --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test 
> > --group_reporting --gtod_reduce=1
> >
>
> I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> each queue is served on 3 CPUs.
>
> IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> context switch is increased a lot.

Interestingly use_threaded_interrupts shows a marginal improvement on
my machine with the same fio profile. It was only 5 NVMes, but they've
one queue per-cpu on 112 cores.


Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread Ming Lei
On Wed, Aug 21, 2019 at 04:27:00PM +, Long Li wrote:
> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>
> >>>On Wed, Aug 21, 2019 at 07:47:44AM +, Long Li wrote:
> >>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>> >>>
> >>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
> >>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM  wrote:
> >>>> >>>>>
> >>>> >>>>> From: Long Li 
> >>>> >>>>>
> >>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
> >>>> >>>>>
> >>>> >>>>> On large systems with many CPUs, a number of CPUs may share
> >>>one
> >>>> >>>NVMe
> >>>> >>>>> hardware queue. It may have this situation where several CPUs
> >>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
> >>>> >>>>> the
> >>>> >>>hardware queue is bound to.
> >>>> >>>>> This may result in that CPU swamped by interrupts and stay in
> >>>> >>>>> interrupt mode for extended time while other CPUs continue to
> >>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
> >>>> >>>>> the system
> >>>> >>>unresponsive.
> >>>> >>>>>
> >>>> >>>>> This patch set addresses this by enforcing scheduling and
> >>>> >>>>> throttling I/O when CPU is starved in this situation.
> >>>> >>>>>
> >>>> >>>>> Long Li (3):
> >>>> >>>>>   sched: define a function to report the number of context switches
> >>>on a
> >>>> >>>>> CPU
> >>>> >>>>>   sched: export idle_cpu()
> >>>> >>>>>   nvme: complete request in work queue on CPU with flooded
> >>>> >>>>> interrupts
> >>>> >>>>>
> >>>> >>>>>  drivers/nvme/host/core.c | 57
> >>>> >>>>> +++-
> >>>> >>>>>  drivers/nvme/host/nvme.h |  1 +
> >>>> >>>>>  include/linux/sched.h|  2 ++
> >>>> >>>>>  kernel/sched/core.c  |  7 +
> >>>> >>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
> >>>> >>>>
> >>>> >>>> Another simpler solution may be to complete request in threaded
> >>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
> >>>> >>>> the interrupt thread handler on CPUs specified by the irq
> >>>> >>>> affinity mask, which was discussed by the following link:
> >>>> >>>>
> >>>> >>>>
> >>>> >>>https://lor
> >>>> >>>e
> >>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
> >>>> >>>58f7d056383e%40huawei.com
> >>>> >>>> %2Fdata=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
> >>>73f45
> >>>> >>>176d1c08
> >>>> >>>>
> >>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> >>>70188
> >>>> >>>8401588
> >>>> >>>>
> >>>> >>>9866sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
> >>>awfxV
> >>>> >>>Er3U%3D
> >>>> >>>> reserved=0
> >>>> >>>>
> >>>> >>>> Could you try the above solution and see if the lockup can be
> >>>avoided?
> >>>> >>>> John Garry
> >>>> >>>> should have workable patch.
> >>>> >>>
> >>>> >>>Yeah, so we experimented with changing the interrupt handling in
> >>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
> >>>> >>>plus patch below, and saw a significant throughput boost:
> >>>> >>>
> >>>> >>>--->8

RE: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread Long Li
>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On Wed, Aug 21, 2019 at 07:47:44AM +, Long Li wrote:
>>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>> >>>
>>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM  wrote:
>>>> >>>>>
>>>> >>>>> From: Long Li 
>>>> >>>>>
>>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>> >>>>>
>>>> >>>>> On large systems with many CPUs, a number of CPUs may share
>>>one
>>>> >>>NVMe
>>>> >>>>> hardware queue. It may have this situation where several CPUs
>>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
>>>> >>>>> the
>>>> >>>hardware queue is bound to.
>>>> >>>>> This may result in that CPU swamped by interrupts and stay in
>>>> >>>>> interrupt mode for extended time while other CPUs continue to
>>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
>>>> >>>>> the system
>>>> >>>unresponsive.
>>>> >>>>>
>>>> >>>>> This patch set addresses this by enforcing scheduling and
>>>> >>>>> throttling I/O when CPU is starved in this situation.
>>>> >>>>>
>>>> >>>>> Long Li (3):
>>>> >>>>>   sched: define a function to report the number of context switches
>>>on a
>>>> >>>>> CPU
>>>> >>>>>   sched: export idle_cpu()
>>>> >>>>>   nvme: complete request in work queue on CPU with flooded
>>>> >>>>> interrupts
>>>> >>>>>
>>>> >>>>>  drivers/nvme/host/core.c | 57
>>>> >>>>> +++-
>>>> >>>>>  drivers/nvme/host/nvme.h |  1 +
>>>> >>>>>  include/linux/sched.h|  2 ++
>>>> >>>>>  kernel/sched/core.c  |  7 +
>>>> >>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
>>>> >>>>
>>>> >>>> Another simpler solution may be to complete request in threaded
>>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
>>>> >>>> the interrupt thread handler on CPUs specified by the irq
>>>> >>>> affinity mask, which was discussed by the following link:
>>>> >>>>
>>>> >>>>
>>>> >>>https://lor
>>>> >>>e
>>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>> >>>58f7d056383e%40huawei.com
>>>> >>>> %2Fdata=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
>>>73f45
>>>> >>>176d1c08
>>>> >>>>
>>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
>>>70188
>>>> >>>8401588
>>>> >>>>
>>>> >>>9866sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
>>>awfxV
>>>> >>>Er3U%3D
>>>> >>>> reserved=0
>>>> >>>>
>>>> >>>> Could you try the above solution and see if the lockup can be
>>>avoided?
>>>> >>>> John Garry
>>>> >>>> should have workable patch.
>>>> >>>
>>>> >>>Yeah, so we experimented with changing the interrupt handling in
>>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
>>>> >>>plus patch below, and saw a significant throughput boost:
>>>> >>>
>>>> >>>--->8
>>>> >>>
>>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard
>>>> >>>irq affinity
>>>> >>>
>>>> >>>Currently the cpu allowed mask for the threaded part of a threaded
>>>> >>>irq handler will be set to the effective affinity of the hard irq.
>>>> >>>
>>>> >>&g

Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread John Garry

On 21/08/2019 10:44, Ming Lei wrote:

On Wed, Aug 21, 2019 at 07:47:44AM +, Long Li wrote:

Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On 20/08/2019 09:25, Ming Lei wrote:

On Tue, Aug 20, 2019 at 2:14 PM  wrote:


From: Long Li 

This patch set tries to fix interrupt swamp in NVMe devices.

On large systems with many CPUs, a number of CPUs may share one

NVMe

hardware queue. It may have this situation where several CPUs are
issuing I/Os, and all the I/Os are returned on the CPU where the

hardware queue is bound to.

This may result in that CPU swamped by interrupts and stay in
interrupt mode for extended time while other CPUs continue to issue
I/O. This can trigger Watchdog and RCU timeout, and make the system

unresponsive.


This patch set addresses this by enforcing scheduling and throttling
I/O when CPU is starved in this situation.

Long Li (3):
  sched: define a function to report the number of context switches on a
CPU
  sched: export idle_cpu()
  nvme: complete request in work queue on CPU with flooded interrupts

 drivers/nvme/host/core.c | 57
+++-
 drivers/nvme/host/nvme.h |  1 +
 include/linux/sched.h|  2 ++
 kernel/sched/core.c  |  7 +
 4 files changed, 66 insertions(+), 1 deletion(-)


Another simpler solution may be to complete request in threaded
interrupt handler for this case. Meantime allow scheduler to run the
interrupt thread handler on CPUs specified by the irq affinity mask,
which was discussed by the following link:



https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
e

.kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-

58f7d056383e%40huawei.com

%2Fdata=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45

176d1c08



d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
8401588



9866sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
Er3U%3D

reserved=0

Could you try the above solution and see if the lockup can be avoided?
John Garry
should have workable patch.


Yeah, so we experimented with changing the interrupt handling in the SCSI
driver I maintain to use a threaded handler IRQ handler plus patch below,
and saw a significant throughput boost:

--->8

Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity

Currently the cpu allowed mask for the threaded part of a threaded irq
handler will be set to the effective affinity of the hard irq.

Typically the effective affinity of the hard irq will be for a single cpu. As 
such,
the threaded handler would always run on the same cpu as the hard irq.

We have seen scenarios in high data-rate throughput testing that the cpu
handling the interrupt can be totally saturated handling both the hard
interrupt and threaded handler parts, limiting throughput.

Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
interrupt to decide on the policy of which cpu the threaded handler may run.

Signed-off-by: John Garry 


Thanks for pointing me to this patch. This fixed the interrupt swamp and make 
the system stable.

However I'm seeing reduced performance when using threaded interrupts.

Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 
hardware queues for each disk)
Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with 
queue depth = 64, num of jobs = 80, direct=1

With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS
With just interrupts and my patch: 3700k IOPS


This gap looks too big wrt. threaded interrupts vs. interrupts.



At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost 
of doing wake up and context switch for NVMe threaded IRQ handler takes some 
CPU away.



In theory, it shouldn't be so because most of times the thread should be running
on CPUs of this hctx, and the wakeup cost shouldn't be so big. Maybe there is
performance problem somewhere wrt. threaded interrupt.

Could you share us your test script and environment? I will see if I can
reproduce it in my environment.


In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for 
NVMe:

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index a1de501a2729..3fb30d16464e 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, 
irq_handler_t handler,
va_list ap;
int ret;
char *devname;
-   unsigned long irqflags = IRQF_SHARED;
+   unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;

if (!handler)
irqflags |= IRQF_ONESHOT;



I don't see why IRQF_IRQ_AFFINITY is needed.

John, could you explain it a bit why you need changes on IRQF_IRQ_AFFINITY?


Hi Ming,

The patch I shared was my original solution, based on the driver setting 
flag IRQF_IRQ_AFFINITY to request the threaded handler uses the irq 
affinity mask for the handler cpu allowed mask.


If we want to make this decision based only on whether the 

Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread Ming Lei
On Wed, Aug 21, 2019 at 07:47:44AM +, Long Li wrote:
> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>
> >>>On 20/08/2019 09:25, Ming Lei wrote:
> >>>> On Tue, Aug 20, 2019 at 2:14 PM  wrote:
> >>>>>
> >>>>> From: Long Li 
> >>>>>
> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
> >>>>>
> >>>>> On large systems with many CPUs, a number of CPUs may share one
> >>>NVMe
> >>>>> hardware queue. It may have this situation where several CPUs are
> >>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
> >>>hardware queue is bound to.
> >>>>> This may result in that CPU swamped by interrupts and stay in
> >>>>> interrupt mode for extended time while other CPUs continue to issue
> >>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
> >>>unresponsive.
> >>>>>
> >>>>> This patch set addresses this by enforcing scheduling and throttling
> >>>>> I/O when CPU is starved in this situation.
> >>>>>
> >>>>> Long Li (3):
> >>>>>   sched: define a function to report the number of context switches on a
> >>>>> CPU
> >>>>>   sched: export idle_cpu()
> >>>>>   nvme: complete request in work queue on CPU with flooded interrupts
> >>>>>
> >>>>>  drivers/nvme/host/core.c | 57
> >>>>> +++-
> >>>>>  drivers/nvme/host/nvme.h |  1 +
> >>>>>  include/linux/sched.h|  2 ++
> >>>>>  kernel/sched/core.c  |  7 +
> >>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
> >>>>
> >>>> Another simpler solution may be to complete request in threaded
> >>>> interrupt handler for this case. Meantime allow scheduler to run the
> >>>> interrupt thread handler on CPUs specified by the irq affinity mask,
> >>>> which was discussed by the following link:
> >>>>
> >>>>
> >>>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
> >>>e
> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
> >>>58f7d056383e%40huawei.com
> >>>> %2Fdata=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
> >>>176d1c08
> >>>>
> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
> >>>8401588
> >>>>
> >>>9866sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
> >>>Er3U%3D
> >>>> reserved=0
> >>>>
> >>>> Could you try the above solution and see if the lockup can be avoided?
> >>>> John Garry
> >>>> should have workable patch.
> >>>
> >>>Yeah, so we experimented with changing the interrupt handling in the SCSI
> >>>driver I maintain to use a threaded handler IRQ handler plus patch below,
> >>>and saw a significant throughput boost:
> >>>
> >>>--->8
> >>>
> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard irq 
> >>>affinity
> >>>
> >>>Currently the cpu allowed mask for the threaded part of a threaded irq
> >>>handler will be set to the effective affinity of the hard irq.
> >>>
> >>>Typically the effective affinity of the hard irq will be for a single cpu. 
> >>>As such,
> >>>the threaded handler would always run on the same cpu as the hard irq.
> >>>
> >>>We have seen scenarios in high data-rate throughput testing that the cpu
> >>>handling the interrupt can be totally saturated handling both the hard
> >>>interrupt and threaded handler parts, limiting throughput.
> >>>
> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
> >>>interrupt to decide on the policy of which cpu the threaded handler may 
> >>>run.
> >>>
> >>>Signed-off-by: John Garry 
> 
> Thanks for pointing me to this patch. This fixed the interrupt swamp and make 
> the system stable.
> 
> However I'm seeing reduced performance when using threaded interrupts.
> 
> Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 
> hardware queues for each disk)
> Benchmark tool is FIO

RE: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-21 Thread Long Li
>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> On Tue, Aug 20, 2019 at 2:14 PM  wrote:
>>>>>
>>>>> From: Long Li 
>>>>>
>>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>>>
>>>>> On large systems with many CPUs, a number of CPUs may share one
>>>NVMe
>>>>> hardware queue. It may have this situation where several CPUs are
>>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
>>>hardware queue is bound to.
>>>>> This may result in that CPU swamped by interrupts and stay in
>>>>> interrupt mode for extended time while other CPUs continue to issue
>>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
>>>unresponsive.
>>>>>
>>>>> This patch set addresses this by enforcing scheduling and throttling
>>>>> I/O when CPU is starved in this situation.
>>>>>
>>>>> Long Li (3):
>>>>>   sched: define a function to report the number of context switches on a
>>>>> CPU
>>>>>   sched: export idle_cpu()
>>>>>   nvme: complete request in work queue on CPU with flooded interrupts
>>>>>
>>>>>  drivers/nvme/host/core.c | 57
>>>>> +++-
>>>>>  drivers/nvme/host/nvme.h |  1 +
>>>>>  include/linux/sched.h|  2 ++
>>>>>  kernel/sched/core.c  |  7 +
>>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
>>>>
>>>> Another simpler solution may be to complete request in threaded
>>>> interrupt handler for this case. Meantime allow scheduler to run the
>>>> interrupt thread handler on CPUs specified by the irq affinity mask,
>>>> which was discussed by the following link:
>>>>
>>>>
>>>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
>>>e
>>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>58f7d056383e%40huawei.com
>>>> %2Fdata=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
>>>176d1c08
>>>>
>>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
>>>8401588
>>>>
>>>9866sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
>>>Er3U%3D
>>>> reserved=0
>>>>
>>>> Could you try the above solution and see if the lockup can be avoided?
>>>> John Garry
>>>> should have workable patch.
>>>
>>>Yeah, so we experimented with changing the interrupt handling in the SCSI
>>>driver I maintain to use a threaded handler IRQ handler plus patch below,
>>>and saw a significant throughput boost:
>>>
>>>--->8
>>>
>>>Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
>>>
>>>Currently the cpu allowed mask for the threaded part of a threaded irq
>>>handler will be set to the effective affinity of the hard irq.
>>>
>>>Typically the effective affinity of the hard irq will be for a single cpu. 
>>>As such,
>>>the threaded handler would always run on the same cpu as the hard irq.
>>>
>>>We have seen scenarios in high data-rate throughput testing that the cpu
>>>handling the interrupt can be totally saturated handling both the hard
>>>interrupt and threaded handler parts, limiting throughput.
>>>
>>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
>>>interrupt to decide on the policy of which cpu the threaded handler may run.
>>>
>>>Signed-off-by: John Garry 

Thanks for pointing me to this patch. This fixed the interrupt swamp and make 
the system stable.

However I'm seeing reduced performance when using threaded interrupts.

Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 
hardware queues for each disk)
Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with 
queue depth = 64, num of jobs = 80, direct=1

With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS
With just interrupts and my patch: 3700k IOPS

At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost 
of doing wake up and context switch for NVMe threaded IRQ handler takes some 
CPU away.

In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for 
NVMe:

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index a1de501a2729..3fb30

Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-20 Thread Keith Busch
On Tue, Aug 20, 2019 at 01:59:32AM -0700, John Garry wrote:
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e8f7f179bf77..cb483a055512 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, 
> struct irqaction *action)
>* mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
>*/
>   if (cpumask_available(desc->irq_common_data.affinity)) {
> + struct irq_data *irq_data = >irq_data;
>   const struct cpumask *m;
> 
> - m = irq_data_get_effective_affinity_mask(>irq_data);
> + if (action->flags & IRQF_IRQ_AFFINITY)
> + m = desc->irq_common_data.affinity;
> + else
> + m = irq_data_get_effective_affinity_mask(irq_data);
>   cpumask_copy(mask, m);
>   } else {
>   valid = false;
> -- 
> 2.17.1
> 
> As Ming mentioned in that same thread, we could even make this policy 
> for managed interrupts.

Ack, I really like this option!


Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-20 Thread John Garry

On 20/08/2019 09:25, Ming Lei wrote:

On Tue, Aug 20, 2019 at 2:14 PM  wrote:


From: Long Li 

This patch set tries to fix interrupt swamp in NVMe devices.

On large systems with many CPUs, a number of CPUs may share one NVMe hardware
queue. It may have this situation where several CPUs are issuing I/Os, and
all the I/Os are returned on the CPU where the hardware queue is bound to.
This may result in that CPU swamped by interrupts and stay in interrupt mode
for extended time while other CPUs continue to issue I/O. This can trigger
Watchdog and RCU timeout, and make the system unresponsive.

This patch set addresses this by enforcing scheduling and throttling I/O when
CPU is starved in this situation.

Long Li (3):
  sched: define a function to report the number of context switches on a
CPU
  sched: export idle_cpu()
  nvme: complete request in work queue on CPU with flooded interrupts

 drivers/nvme/host/core.c | 57 +++-
 drivers/nvme/host/nvme.h |  1 +
 include/linux/sched.h|  2 ++
 kernel/sched/core.c  |  7 +
 4 files changed, 66 insertions(+), 1 deletion(-)


Another simpler solution may be to complete request in threaded interrupt
handler for this case. Meantime allow scheduler to run the interrupt thread
handler on CPUs specified by the irq affinity mask, which was discussed by
the following link:

https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d0563...@huawei.com/

Could you try the above solution and see if the lockup can be avoided?
John Garry
should have workable patch.


Yeah, so we experimented with changing the interrupt handling in the 
SCSI driver I maintain to use a threaded handler IRQ handler plus patch 
below, and saw a significant throughput boost:


--->8

Subject: [PATCH] genirq: Add support to allow thread to use hard irq 
affinity


Currently the cpu allowed mask for the threaded part of a threaded irq
handler will be set to the effective affinity of the hard irq.

Typically the effective affinity of the hard irq will be for a single
cpu. As such, the threaded handler would always run on the same cpu as
the hard irq.

We have seen scenarios in high data-rate throughput testing that the
cpu handling the interrupt can be totally saturated handling both the
hard interrupt and threaded handler parts, limiting throughput.

Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
interrupt to decide on the policy of which cpu the threaded handler
may run.

Signed-off-by: John Garry 

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a99b2a..48e8b955989a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
  *interrupt handler after suspending interrupts. For 
system
  *wakeup devices users need to implement wakeup 
detection in

  *their interrupt handlers.
+ * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu
+ *allowed mask for the threaded handler of a threaded 
interrupt

+ *handler, rather than the effective hard irq affinity.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -74,6 +77,7 @@
 #define IRQF_NO_THREAD 0x0001
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
+#define IRQF_IRQ_AFFINITY  0x0008

 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..cb483a055512 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, 
struct irqaction *action)

 * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
 */
if (cpumask_available(desc->irq_common_data.affinity)) {
+   struct irq_data *irq_data = >irq_data;
const struct cpumask *m;

-   m = irq_data_get_effective_affinity_mask(>irq_data);
+   if (action->flags & IRQF_IRQ_AFFINITY)
+   m = desc->irq_common_data.affinity;
+   else
+   m = irq_data_get_effective_affinity_mask(irq_data);
cpumask_copy(mask, m);
} else {
valid = false;
--
2.17.1

As Ming mentioned in that same thread, we could even make this policy 
for managed interrupts.


Cheers,
John



Thanks,
Ming Lei

.






Re: [PATCH 0/3] fix interrupt swamp in NVMe

2019-08-20 Thread Ming Lei
On Tue, Aug 20, 2019 at 2:14 PM  wrote:
>
> From: Long Li 
>
> This patch set tries to fix interrupt swamp in NVMe devices.
>
> On large systems with many CPUs, a number of CPUs may share one NVMe hardware
> queue. It may have this situation where several CPUs are issuing I/Os, and
> all the I/Os are returned on the CPU where the hardware queue is bound to.
> This may result in that CPU swamped by interrupts and stay in interrupt mode
> for extended time while other CPUs continue to issue I/O. This can trigger
> Watchdog and RCU timeout, and make the system unresponsive.
>
> This patch set addresses this by enforcing scheduling and throttling I/O when
> CPU is starved in this situation.
>
> Long Li (3):
>   sched: define a function to report the number of context switches on a
> CPU
>   sched: export idle_cpu()
>   nvme: complete request in work queue on CPU with flooded interrupts
>
>  drivers/nvme/host/core.c | 57 +++-
>  drivers/nvme/host/nvme.h |  1 +
>  include/linux/sched.h|  2 ++
>  kernel/sched/core.c  |  7 +
>  4 files changed, 66 insertions(+), 1 deletion(-)

Another simpler solution may be to complete request in threaded interrupt
handler for this case. Meantime allow scheduler to run the interrupt thread
handler on CPUs specified by the irq affinity mask, which was discussed by
the following link:

https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d0563...@huawei.com/

Could you try the above solution and see if the lockup can be avoided?
John Garry
should have workable patch.

Thanks,
Ming Lei


[PATCH 0/3] fix interrupt swamp in NVMe

2019-08-20 Thread longli
From: Long Li 

This patch set tries to fix interrupt swamp in NVMe devices.

On large systems with many CPUs, a number of CPUs may share one NVMe hardware
queue. It may have this situation where several CPUs are issuing I/Os, and
all the I/Os are returned on the CPU where the hardware queue is bound to.
This may result in that CPU swamped by interrupts and stay in interrupt mode
for extended time while other CPUs continue to issue I/O. This can trigger
Watchdog and RCU timeout, and make the system unresponsive.

This patch set addresses this by enforcing scheduling and throttling I/O when
CPU is starved in this situation.

Long Li (3):
  sched: define a function to report the number of context switches on a
CPU
  sched: export idle_cpu()
  nvme: complete request in work queue on CPU with flooded interrupts

 drivers/nvme/host/core.c | 57 +++-
 drivers/nvme/host/nvme.h |  1 +
 include/linux/sched.h|  2 ++
 kernel/sched/core.c  |  7 +
 4 files changed, 66 insertions(+), 1 deletion(-)

-- 
2.17.1