RE: Affinity managed interrupts vs non-managed interrupts
> > The point I don't get here is why you need separate reply queues for > the interrupt coalesce setting. Shouldn't this just be a flag at > submission time that indicates the amount of coalescing that should > happen? > > What is the benefit of having different completion queues? Having different set of queues (it will is something like N:16 where N queues are without interrupt coalescing and 16 dedicated queues for interrupt coalescing) we want to avoid penalty introduced by interrupt coalescing especially for lower QD profiles. Kashyap
RE: Affinity managed interrupts vs non-managed interrupts
> > The point I don't get here is why you need separate reply queues for > the interrupt coalesce setting. Shouldn't this just be a flag at > submission time that indicates the amount of coalescing that should > happen? > > What is the benefit of having different completion queues? Having different set of queues (it will is something like N:16 where N queues are without interrupt coalescing and 16 dedicated queues for interrupt coalescing) we want to avoid penalty introduced by interrupt coalescing especially for lower QD profiles. Kashyap
Re: Affinity managed interrupts vs non-managed interrupts
On Wed, Aug 29, 2018 at 04:16:23PM +0530, Sumit Saxena wrote: > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. The point I don't get here is why you need separate reply queues for the interrupt coalesce setting. Shouldn't this just be a flag at submission time that indicates the amount of coalescing that should happen? What is the benefit of having different completion queues?
Re: Affinity managed interrupts vs non-managed interrupts
On Wed, Aug 29, 2018 at 04:16:23PM +0530, Sumit Saxena wrote: > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. The point I don't get here is why you need separate reply queues for the interrupt coalesce setting. Shouldn't this just be a flag at submission time that indicates the amount of coalescing that should happen? What is the benefit of having different completion queues?
RE: Affinity managed interrupts vs non-managed interrupts
On Mon, 3 Sep 2018, Kashyap Desai wrote: > I am using " for-4.19/block " and this particular patch "a0c9259 > irq/matrix: Spread interrupts on allocation" is included. Can you please try against 4.19-rc2 or later? > I can see that 16 extra reply queues via pre_vectors are still assigned to > CPU 0 (effective affinity ). > > irq 33, cpu list 0-71 The cpu list is irrelevant because that's the allowed affinity mask. The effective one is what counts. > # cat /sys/kernel/debug/irq/irqs/34 > node: 0 > affinity: 0-71 > effectiv: 0 So if all 16 have their effective affinity set to CPU0 then that's strange at least. Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ? > Ideally, what we are looking for 16 extra pre_vector reply queue is > "effective affinity" to be within local numa node as long as that numa > node has online CPUs. If not, we are ok to have effective cpu from any > node. Well, we surely can do the initial allocation and spreading on the local numa node, but once all CPUs are offline on that node, then the whole thing goes down the drain and allocates from where it sees fit. I'll think about it some more, especially how to avoid the proliferation of the affinity hint. Thanks, tglx
RE: Affinity managed interrupts vs non-managed interrupts
On Mon, 3 Sep 2018, Kashyap Desai wrote: > I am using " for-4.19/block " and this particular patch "a0c9259 > irq/matrix: Spread interrupts on allocation" is included. Can you please try against 4.19-rc2 or later? > I can see that 16 extra reply queues via pre_vectors are still assigned to > CPU 0 (effective affinity ). > > irq 33, cpu list 0-71 The cpu list is irrelevant because that's the allowed affinity mask. The effective one is what counts. > # cat /sys/kernel/debug/irq/irqs/34 > node: 0 > affinity: 0-71 > effectiv: 0 So if all 16 have their effective affinity set to CPU0 then that's strange at least. Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ? > Ideally, what we are looking for 16 extra pre_vector reply queue is > "effective affinity" to be within local numa node as long as that numa > node has online CPUs. If not, we are ok to have effective cpu from any > node. Well, we surely can do the initial allocation and spreading on the local numa node, but once all CPUs are offline on that node, then the whole thing goes down the drain and allocates from where it sees fit. I'll think about it some more, especially how to avoid the proliferation of the affinity hint. Thanks, tglx
RE: Affinity managed interrupts vs non-managed interrupts
> > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the >system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be >grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on >CPU hotplug. That means on unplug you have to wait for the reply queue >which is associated to the outgoing CPU to be empty and no new requests >to be queued, which has to be done for the regular per CPU reply queues >anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the >hard/firmware to direct the reply to the first online reply queue in the >group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx
RE: Affinity managed interrupts vs non-managed interrupts
> > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the >system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be >grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on >CPU hotplug. That means on unplug you have to wait for the reply queue >which is associated to the outgoing CPU to be empty and no new requests >to be queued, which has to be done for the regular per CPU reply queues >anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the >hard/firmware to direct the reply to the first online reply queue in the >group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx
Re: Affinity managed interrupts vs non-managed interrupts
On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena wrote: > > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena > > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. I am just wondering how you can make the decision about using extra 16 or regular 72 queues in submission path, could you share us a bit your idea? How are you going to recognize the IO workload inside your driver? Even the current block layer doesn't recognize IO workload, such as random IO or sequential IO. Frankly speaking, you may reuse the 72 reply queues to do interrupt coalescing by configuring one extra register to enable the coalescing mode, and you may just use small part of the 72 reply queues under the interrupt coalescing mode. Or you can learn from SPDK to use one or small number of dedicated cores or kernel threads to poll the interrupts from all reply queues, then I guess you may benefit much compared with the extra 16 queue approach. Introducing extra 16 queues just for interrupt coalescing and making it coexisting with the regular 72 reply queues seems one very unusual use case, not sure the current genirq affinity can support it well. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. irqbalance daemon can't cover managed interrupts, or you mean you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? Thanks, Ming Lei
Re: Affinity managed interrupts vs non-managed interrupts
On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena wrote: > > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena > > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. I am just wondering how you can make the decision about using extra 16 or regular 72 queues in submission path, could you share us a bit your idea? How are you going to recognize the IO workload inside your driver? Even the current block layer doesn't recognize IO workload, such as random IO or sequential IO. Frankly speaking, you may reuse the 72 reply queues to do interrupt coalescing by configuring one extra register to enable the coalescing mode, and you may just use small part of the 72 reply queues under the interrupt coalescing mode. Or you can learn from SPDK to use one or small number of dedicated cores or kernel threads to poll the interrupts from all reply queues, then I guess you may benefit much compared with the extra 16 queue approach. Introducing extra 16 queues just for interrupt coalescing and making it coexisting with the regular 72 reply queues seems one very unusual use case, not sure the current genirq affinity can support it well. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. irqbalance daemon can't cover managed interrupts, or you mean you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? Thanks, Ming Lei
RE: Affinity managed interrupts vs non-managed interrupts
Hi Thomas, Ming, Chris et all, Your input will help us to do changes for megaraid_sas driver. We are currently waiting for community response. Is it recommended to use " pci_enable_msix_range" and have low level driver do affinity setting because current APIs around pci_alloc_irq_vectors do not meet our requirement. We want more than online CPU msix vectors and using pre_vector we can do that, but first 16 msix should be mapped to local numa node with effective cpu spread across cpus of local numa node. This is not possible using pci_alloc_irq_vectors_affinity. Do we need kernel API changes or let's have low level driver to manage it via irq_set_affinity_hint ? Kashyap > -Original Message- > From: Sumit Saxena [mailto:sumit.sax...@broadcom.com] > Sent: Wednesday, August 29, 2018 4:46 AM > To: Ming Lei > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap > Desai; Shivasharan Srikanteshwara > Subject: RE: Affinity managed interrupts vs non-managed interrupts > > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena > > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > Thanks, > > Ming
RE: Affinity managed interrupts vs non-managed interrupts
Hi Thomas, Ming, Chris et all, Your input will help us to do changes for megaraid_sas driver. We are currently waiting for community response. Is it recommended to use " pci_enable_msix_range" and have low level driver do affinity setting because current APIs around pci_alloc_irq_vectors do not meet our requirement. We want more than online CPU msix vectors and using pre_vector we can do that, but first 16 msix should be mapped to local numa node with effective cpu spread across cpus of local numa node. This is not possible using pci_alloc_irq_vectors_affinity. Do we need kernel API changes or let's have low level driver to manage it via irq_set_affinity_hint ? Kashyap > -Original Message- > From: Sumit Saxena [mailto:sumit.sax...@broadcom.com] > Sent: Wednesday, August 29, 2018 4:46 AM > To: Ming Lei > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap > Desai; Shivasharan Srikanteshwara > Subject: RE: Affinity managed interrupts vs non-managed interrupts > > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena > > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > Thanks, > > Ming
RE: Affinity managed interrupts vs non-managed interrupts
> -Original Message- > From: Ming Lei [mailto:ming@redhat.com] > Sent: Wednesday, August 29, 2018 2:16 PM > To: Sumit Saxena > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > Hello Sumit, Hi Ming, Thanks for response. > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > Affinity managed interrupts vs non-managed interrupts > > > > Hi Thomas, > > > > We are working on next generation MegaRAID product where requirement > > is- to allocate additional 16 MSI-x vectors in addition to number of > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > supports 128 MSI-x vectors. > > > > To explain the requirement and solution, consider that we have 2 > > socket system (each socket having 36 logical CPUs). Current driver > > will allocate total 72 MSI-x vectors by calling API- > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > vectors will have affinity across NUMA node s and interrupts are affinity > managed. > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > Could you explain a bit what the specific use case the extra 16 vectors is? We are trying to avoid the penalty due to one interrupt per IO completion and decided to coalesce interrupts on these extra 16 reply queues. For regular 72 reply queues, we will not coalesce interrupts as for low IO workload, interrupt coalescing may take more time due to less IO completions. In IO submission path, driver will decide which set of reply queues (either extra 16 reply queues or regular 72 reply queues) to be picked based on IO workload. > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > ffective affinity of each vector is to CPU 0. Our requirement is to > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > effective CPU should be spread within local node cpu mask. Without > > changing kernel code, we can > > If all CPUs in one NUMA node is offline, can this use case work as expected? > Seems we have to understand what the use case is and how it works. Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be broken and irqbalancer takes care of migrating affected IRQs to online CPUs of different NUMA node. When offline CPUs are onlined again, irqbalancer restores affinity. > > > Thanks, > Ming
RE: Affinity managed interrupts vs non-managed interrupts
> -Original Message- > From: Ming Lei [mailto:ming@redhat.com] > Sent: Wednesday, August 29, 2018 2:16 PM > To: Sumit Saxena > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > Hello Sumit, Hi Ming, Thanks for response. > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > Affinity managed interrupts vs non-managed interrupts > > > > Hi Thomas, > > > > We are working on next generation MegaRAID product where requirement > > is- to allocate additional 16 MSI-x vectors in addition to number of > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > supports 128 MSI-x vectors. > > > > To explain the requirement and solution, consider that we have 2 > > socket system (each socket having 36 logical CPUs). Current driver > > will allocate total 72 MSI-x vectors by calling API- > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > vectors will have affinity across NUMA node s and interrupts are affinity > managed. > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > Could you explain a bit what the specific use case the extra 16 vectors is? We are trying to avoid the penalty due to one interrupt per IO completion and decided to coalesce interrupts on these extra 16 reply queues. For regular 72 reply queues, we will not coalesce interrupts as for low IO workload, interrupt coalescing may take more time due to less IO completions. In IO submission path, driver will decide which set of reply queues (either extra 16 reply queues or regular 72 reply queues) to be picked based on IO workload. > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > ffective affinity of each vector is to CPU 0. Our requirement is to > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > effective CPU should be spread within local node cpu mask. Without > > changing kernel code, we can > > If all CPUs in one NUMA node is offline, can this use case work as expected? > Seems we have to understand what the use case is and how it works. Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be broken and irqbalancer takes care of migrating affected IRQs to online CPUs of different NUMA node. When offline CPUs are onlined again, irqbalancer restores affinity. > > > Thanks, > Ming
Re: Affinity managed interrupts vs non-managed interrupts
Hello Sumit, On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > Affinity managed interrupts vs non-managed interrupts > > Hi Thomas, > > We are working on next generation MegaRAID product where requirement is- to > allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors > megaraid_sas driver usually allocates. MegaRAID adapter supports 128 MSI-x > vectors. > > To explain the requirement and solution, consider that we have 2 socket > system (each socket having 36 logical CPUs). Current driver will allocate > total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag- > PCI_IRQ_AFFINITY). All 72 MSI-x vectors will have affinity across NUMA node > s and interrupts are affinity managed. > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16 > and, driver can allocate 16 + 72 MSI-x vectors. Could you explain a bit what the specific use case the extra 16 vectors is? > > All pre_vectors (16) will be mapped to all available online CPUs but e > ffective affinity of each vector is to CPU 0. Our requirement is to have pre > _vectors 16 reply queues to be mapped to local NUMA node with > effective CPU should > be spread within local node cpu mask. Without changing kernel code, we can If all CPUs in one NUMA node is offline, can this use case work as expected? Seems we have to understand what the use case is and how it works. Thanks, Ming
Re: Affinity managed interrupts vs non-managed interrupts
Hello Sumit, On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > Affinity managed interrupts vs non-managed interrupts > > Hi Thomas, > > We are working on next generation MegaRAID product where requirement is- to > allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors > megaraid_sas driver usually allocates. MegaRAID adapter supports 128 MSI-x > vectors. > > To explain the requirement and solution, consider that we have 2 socket > system (each socket having 36 logical CPUs). Current driver will allocate > total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag- > PCI_IRQ_AFFINITY). All 72 MSI-x vectors will have affinity across NUMA node > s and interrupts are affinity managed. > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16 > and, driver can allocate 16 + 72 MSI-x vectors. Could you explain a bit what the specific use case the extra 16 vectors is? > > All pre_vectors (16) will be mapped to all available online CPUs but e > ffective affinity of each vector is to CPU 0. Our requirement is to have pre > _vectors 16 reply queues to be mapped to local NUMA node with > effective CPU should > be spread within local node cpu mask. Without changing kernel code, we can If all CPUs in one NUMA node is offline, can this use case work as expected? Seems we have to understand what the use case is and how it works. Thanks, Ming