>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe >>> >>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote: >>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe >>>> >>> >>>> >>>On 20/08/2019 09:25, Ming Lei wrote: >>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM <lon...@linuxonhyperv.com> wrote: >>>> >>>>> >>>> >>>>> From: Long Li <lon...@microsoft.com> >>>> >>>>> >>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices. >>>> >>>>> >>>> >>>>> On large systems with many CPUs, a number of CPUs may share >>>one >>>> >>>NVMe >>>> >>>>> hardware queue. It may have this situation where several CPUs >>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where >>>> >>>>> the >>>> >>>hardware queue is bound to. >>>> >>>>> This may result in that CPU swamped by interrupts and stay in >>>> >>>>> interrupt mode for extended time while other CPUs continue to >>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make >>>> >>>>> the system >>>> >>>unresponsive. >>>> >>>>> >>>> >>>>> This patch set addresses this by enforcing scheduling and >>>> >>>>> throttling I/O when CPU is starved in this situation. >>>> >>>>> >>>> >>>>> Long Li (3): >>>> >>>>> sched: define a function to report the number of context switches >>>on a >>>> >>>>> CPU >>>> >>>>> sched: export idle_cpu() >>>> >>>>> nvme: complete request in work queue on CPU with flooded >>>> >>>>> interrupts >>>> >>>>> >>>> >>>>> drivers/nvme/host/core.c | 57 >>>> >>>>> +++++++++++++++++++++++++++++++++++++++- >>>> >>>>> drivers/nvme/host/nvme.h | 1 + >>>> >>>>> include/linux/sched.h | 2 ++ >>>> >>>>> kernel/sched/core.c | 7 +++++ >>>> >>>>> 4 files changed, 66 insertions(+), 1 deletion(-) >>>> >>>> >>>> >>>> Another simpler solution may be to complete request in threaded >>>> >>>> interrupt handler for this case. Meantime allow scheduler to run >>>> >>>> the interrupt thread handler on CPUs specified by the irq >>>> >>>> affinity mask, which was discussed by the following link: >>>> >>>> >>>> >>>> >>>> >>>https://lor >>>> >>>e >>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12- >>>> >>>58f7d056383e%40huawei.com >>>> >>>> %2F&data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2 >>>73f45 >>>> >>>176d1c08 >>>> >>>> >>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63 >>>70188 >>>> >>>8401588 >>>> >>>> >>>> >>>9866&sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb >>>awfxV >>>> >>>Er3U%3D& >>>> >>>> reserved=0 >>>> >>>> >>>> >>>> Could you try the above solution and see if the lockup can be >>>avoided? >>>> >>>> John Garry >>>> >>>> should have workable patch. >>>> >>> >>>> >>>Yeah, so we experimented with changing the interrupt handling in >>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler >>>> >>>plus patch below, and saw a significant throughput boost: >>>> >>> >>>> >>>--->8 >>>> >>> >>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard >>>> >>>irq affinity >>>> >>> >>>> >>>Currently the cpu allowed mask for the threaded part of a threaded >>>> >>>irq handler will be set to the effective affinity of the hard irq. >>>> >>> >>>> >>>Typically the effective affinity of the hard irq will be for a >>>> >>>single cpu. As such, the threaded handler would always run on the >>>same cpu as the hard irq. >>>> >>> >>>> >>>We have seen scenarios in high data-rate throughput testing that >>>> >>>the cpu handling the interrupt can be totally saturated handling >>>> >>>both the hard interrupt and threaded handler parts, limiting >>>throughput. >>>> >>> >>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the >>>> >>>threaded interrupt to decide on the policy of which cpu the threaded >>>handler may run. >>>> >>> >>>> >>>Signed-off-by: John Garry <john.ga...@huawei.com> >>>> >>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and >>>make the system stable. >>>> >>>> However I'm seeing reduced performance when using threaded >>>interrupts. >>>> >>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks >>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern: >>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs >>>> = 80, direct=1 >>>> >>>> With threaded interrupts: 1320k IOPS >>>> With just interrupts: 3720k IOPS >>>> With just interrupts and my patch: 3700k IOPS >>> >>>This gap looks too big wrt. threaded interrupts vs. interrupts. >>> >>>> >>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the >>>cost of doing wake up and context switch for NVMe threaded IRQ handler >>>takes some CPU away. >>>> >>> >>>In theory, it shouldn't be so because most of times the thread should be >>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big. >>>Maybe there is performance problem somewhere wrt. threaded interrupt. >>> >>>Could you share us your test script and environment? I will see if I can >>>reproduce it in my environment.
Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size. Here is the command to benchmark it: fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1 Thanks Long >>> >>>> In this test, I made the following change to make use of >>>IRQF_IRQ_AFFINITY for NVMe: >>>> >>>> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c index >>>> a1de501a2729..3fb30d16464e 100644 >>>> --- a/drivers/pci/irq.c >>>> +++ b/drivers/pci/irq.c >>>> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int >>>nr, irq_handler_t handler, >>>> va_list ap; >>>> int ret; >>>> char *devname; >>>> - unsigned long irqflags = IRQF_SHARED; >>>> + unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY; >>>> >>>> if (!handler) >>>> irqflags |= IRQF_ONESHOT; >>>> >>> >>>I don't see why IRQF_IRQ_AFFINITY is needed. >>> >>>John, could you explain it a bit why you need changes on >>>IRQF_IRQ_AFFINITY? >>> >>>The following patch should be enough: >>> >>>diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index >>>e8f7f179bf77..1e7cffc1c20c 100644 >>>--- a/kernel/irq/manage.c >>>+++ b/kernel/irq/manage.c >>>@@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, >>>struct irqaction *action) >>> if (cpumask_available(desc->irq_common_data.affinity)) { >>> const struct cpumask *m; >>> >>>- m = irq_data_get_effective_affinity_mask(&desc- >>>>irq_data); >>>+ if (irqd_affinity_is_managed(&desc->irq_data)) >>>+ m = desc->irq_common_data.affinity; >>>+ else >>>+ m = irq_data_get_effective_affinity_mask( >>>+ &desc->irq_data); >>> cpumask_copy(mask, m); >>> } else { >>> valid = false; >>> >>> >>>Thanks, >>>Ming