RE: Affinity managed interrupts vs non-managed interrupts

2018-09-11 Thread Kashyap Desai
>
> The point I don't get here is why you need separate reply queues for
> the interrupt coalesce setting.  Shouldn't this just be a flag at
> submission time that indicates the amount of coalescing that should
> happen?
>
> What is the benefit of having different completion queues?

Having different set of queues (it will is something like N:16 where N
queues are without interrupt coalescing and 16 dedicated queues for
interrupt coalescing) we want to avoid penalty introduced by interrupt
coalescing especially for lower QD profiles.

Kashyap


RE: Affinity managed interrupts vs non-managed interrupts

2018-09-11 Thread Kashyap Desai
>
> The point I don't get here is why you need separate reply queues for
> the interrupt coalesce setting.  Shouldn't this just be a flag at
> submission time that indicates the amount of coalescing that should
> happen?
>
> What is the benefit of having different completion queues?

Having different set of queues (it will is something like N:16 where N
queues are without interrupt coalescing and 16 dedicated queues for
interrupt coalescing) we want to avoid penalty introduced by interrupt
coalescing especially for lower QD profiles.

Kashyap


Re: Affinity managed interrupts vs non-managed interrupts

2018-09-11 Thread Christoph Hellwig
On Wed, Aug 29, 2018 at 04:16:23PM +0530, Sumit Saxena wrote:
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.

The point I don't get here is why you need separate reply queues for
the interrupt coalesce setting.  Shouldn't this just be a flag at
submission time that indicates the amount of coalescing that should
happen?

What is the benefit of having different completion queues?


Re: Affinity managed interrupts vs non-managed interrupts

2018-09-11 Thread Christoph Hellwig
On Wed, Aug 29, 2018 at 04:16:23PM +0530, Sumit Saxena wrote:
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.

The point I don't get here is why you need separate reply queues for
the interrupt coalesce setting.  Shouldn't this just be a flag at
submission time that indicates the amount of coalescing that should
happen?

What is the benefit of having different completion queues?


RE: Affinity managed interrupts vs non-managed interrupts

2018-09-03 Thread Thomas Gleixner
On Mon, 3 Sep 2018, Kashyap Desai wrote:
> I am using " for-4.19/block " and this particular patch "a0c9259
> irq/matrix: Spread interrupts on allocation" is included.

Can you please try against 4.19-rc2 or later?

> I can see that 16 extra reply queues via pre_vectors are still assigned to
> CPU 0 (effective affinity ).
> 
> irq 33, cpu list 0-71

The cpu list is irrelevant because that's the allowed affinity mask. The
effective one is what counts.

> # cat /sys/kernel/debug/irq/irqs/34
> node: 0
> affinity: 0-71
> effectiv: 0

So if all 16 have their effective affinity set to CPU0 then that's strange
at least.

Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ?

> Ideally, what we are looking for 16 extra pre_vector reply queue is
> "effective affinity" to be within local numa node as long as that numa
> node has online CPUs. If not, we are ok to have effective cpu from any
> node.

Well, we surely can do the initial allocation and spreading on the local
numa node, but once all CPUs are offline on that node, then the whole thing
goes down the drain and allocates from where it sees fit. I'll think about
it some more, especially how to avoid the proliferation of the affinity
hint.

Thanks,

tglx


RE: Affinity managed interrupts vs non-managed interrupts

2018-09-03 Thread Thomas Gleixner
On Mon, 3 Sep 2018, Kashyap Desai wrote:
> I am using " for-4.19/block " and this particular patch "a0c9259
> irq/matrix: Spread interrupts on allocation" is included.

Can you please try against 4.19-rc2 or later?

> I can see that 16 extra reply queues via pre_vectors are still assigned to
> CPU 0 (effective affinity ).
> 
> irq 33, cpu list 0-71

The cpu list is irrelevant because that's the allowed affinity mask. The
effective one is what counts.

> # cat /sys/kernel/debug/irq/irqs/34
> node: 0
> affinity: 0-71
> effectiv: 0

So if all 16 have their effective affinity set to CPU0 then that's strange
at least.

Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ?

> Ideally, what we are looking for 16 extra pre_vector reply queue is
> "effective affinity" to be within local numa node as long as that numa
> node has online CPUs. If not, we are ok to have effective cpu from any
> node.

Well, we surely can do the initial allocation and spreading on the local
numa node, but once all CPUs are offline on that node, then the whole thing
goes down the drain and allocates from where it sees fit. I'll think about
it some more, especially how to avoid the proliferation of the affinity
hint.

Thanks,

tglx


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-31 Thread Kashyap Desai
> >
> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> TBH, this does not make any sense whatsoever. Why are you trying to have
> extra interrupts for coalescing instead of doing the following:

Thomas,

We are using this feature mainly for performance and not for CPU hotplug
issues.
I read your below #1 to #4  points are more of addressing CPU hotplug
stuffs. Right ?  We also want to make sure if we convert megaraid_sas
driver from managed to non-managed interrupt, we can still achieve CPU
hotplug requirement.  If we use " pci_enable_msix_range" and manually set
affinity in driver  using irq_set_affinity_hint, cpu hotplug feature works
as expected.  is able to retain older mapping and whenever
offlined cpu comes back, irqbalancer restore the same old mapping.

If we use all 72 reply queue (all are in interrupt coalescing mode)
without any extra reply queues, we don't have any issue with cpu-msix
mapping and cpu hotplug issues.
Our major problem with that method is latency is very bad on lower QD
and/or single worker case.

To solve that problem we have added extra 16 reply queue (this is a
special h/w feature for performance only) which can be worked in interrupt
coalescing mode vs existing 72 reply queue will work without any interrupt
coalescing.   Best way to map additional 16 reply queue is map it to the
local numa node.

I understand that, it is unique requirement but at the same time we may be
able to do it gracefully (in irq sub system) as you mentioned  "
irq_set_affinity_hint" should be avoided in low level driver.



>
> 1) Allocate 72 reply queues which get nicely spread out to every CPU on
the
>system with affinity spreading.
>
> 2) Have a configuration for your reply queues which allows them to be
>grouped, e.g. by phsyical package.
>
> 3) Have a mechanism to mark a reply queue offline/online and handle that
on
>CPU hotplug. That means on unplug you have to wait for the reply
queue
>which is associated to the outgoing CPU to be empty and no new
requests
>to be queued, which has to be done for the regular per CPU reply
queues
>anyway.
>
> 4) On queueing the request, flag it 'coalescing' which causes the
>hard/firmware to direct the reply to the first online reply queue in
the
>group.
>
> If the last CPU of a group goes offline, then the normal hotplug
mechanism
> takes effect and the whole thing is put 'offline' as well. This works
> nicely for all kind of scenarios even if you have more CPUs than queues.
No
> extras, no magic affinity hints, it just works.
>
> Hmm?
>
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> I still regret the day when I merged that abomination.

Is it possible to have similar mapping in managed interrupt case as below
?

for (i = 0; i < 16 ; i++)
irq_set_affinity_hint (pci_irq_vector(instance->pdev,
cpumask_of_node(local_numa_node));

Currently we always see managed interrupts for pre-vectors are 0-71 and
effective cpu is always 0.
We want some changes in current API which can allow us to  pass flags
(like *local numa affinity*) and cpu-msix mapping are from local numa node
+ effective cpu are spread across local numa node.

>
> Thanks,
>
>   tglx


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-31 Thread Kashyap Desai
> >
> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> TBH, this does not make any sense whatsoever. Why are you trying to have
> extra interrupts for coalescing instead of doing the following:

Thomas,

We are using this feature mainly for performance and not for CPU hotplug
issues.
I read your below #1 to #4  points are more of addressing CPU hotplug
stuffs. Right ?  We also want to make sure if we convert megaraid_sas
driver from managed to non-managed interrupt, we can still achieve CPU
hotplug requirement.  If we use " pci_enable_msix_range" and manually set
affinity in driver  using irq_set_affinity_hint, cpu hotplug feature works
as expected.  is able to retain older mapping and whenever
offlined cpu comes back, irqbalancer restore the same old mapping.

If we use all 72 reply queue (all are in interrupt coalescing mode)
without any extra reply queues, we don't have any issue with cpu-msix
mapping and cpu hotplug issues.
Our major problem with that method is latency is very bad on lower QD
and/or single worker case.

To solve that problem we have added extra 16 reply queue (this is a
special h/w feature for performance only) which can be worked in interrupt
coalescing mode vs existing 72 reply queue will work without any interrupt
coalescing.   Best way to map additional 16 reply queue is map it to the
local numa node.

I understand that, it is unique requirement but at the same time we may be
able to do it gracefully (in irq sub system) as you mentioned  "
irq_set_affinity_hint" should be avoided in low level driver.



>
> 1) Allocate 72 reply queues which get nicely spread out to every CPU on
the
>system with affinity spreading.
>
> 2) Have a configuration for your reply queues which allows them to be
>grouped, e.g. by phsyical package.
>
> 3) Have a mechanism to mark a reply queue offline/online and handle that
on
>CPU hotplug. That means on unplug you have to wait for the reply
queue
>which is associated to the outgoing CPU to be empty and no new
requests
>to be queued, which has to be done for the regular per CPU reply
queues
>anyway.
>
> 4) On queueing the request, flag it 'coalescing' which causes the
>hard/firmware to direct the reply to the first online reply queue in
the
>group.
>
> If the last CPU of a group goes offline, then the normal hotplug
mechanism
> takes effect and the whole thing is put 'offline' as well. This works
> nicely for all kind of scenarios even if you have more CPUs than queues.
No
> extras, no magic affinity hints, it just works.
>
> Hmm?
>
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> I still regret the day when I merged that abomination.

Is it possible to have similar mapping in managed interrupt case as below
?

for (i = 0; i < 16 ; i++)
irq_set_affinity_hint (pci_irq_vector(instance->pdev,
cpumask_of_node(local_numa_node));

Currently we always see managed interrupts for pre-vectors are 0-71 and
effective cpu is always 0.
We want some changes in current API which can allow us to  pass flags
(like *local numa affinity*) and cpu-msix mapping are from local numa node
+ effective cpu are spread across local numa node.

>
> Thanks,
>
>   tglx


Re: Affinity managed interrupts vs non-managed interrupts

2018-08-31 Thread Ming Lei
On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena  wrote:
>
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena 
> > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.

I am just wondering how you can make the decision about using extra
16 or regular 72 queues in submission path, could you share us a bit
your idea? How are you going to recognize the IO workload inside your
driver? Even the current block layer doesn't recognize IO workload, such
as random IO or sequential IO.

Frankly speaking, you may reuse the 72 reply queues to do interrupt
coalescing by configuring one extra register to enable the coalescing mode,
and you may just use small part of the 72 reply queues under the
interrupt coalescing mode.

Or you can learn from SPDK to use one or small number of dedicated cores
or kernel threads to poll the interrupts from all reply queues, then I
guess you may benefit much compared with the extra 16 queue approach.

Introducing extra 16 queues just for interrupt coalescing and making it
coexisting with the regular 72 reply queues seems one very unusual use
case, not sure the current genirq affinity can support it well.

> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.

 irqbalance daemon can't cover managed interrupts, or you mean
you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?

Thanks,
Ming Lei


Re: Affinity managed interrupts vs non-managed interrupts

2018-08-31 Thread Ming Lei
On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena  wrote:
>
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena 
> > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.

I am just wondering how you can make the decision about using extra
16 or regular 72 queues in submission path, could you share us a bit
your idea? How are you going to recognize the IO workload inside your
driver? Even the current block layer doesn't recognize IO workload, such
as random IO or sequential IO.

Frankly speaking, you may reuse the 72 reply queues to do interrupt
coalescing by configuring one extra register to enable the coalescing mode,
and you may just use small part of the 72 reply queues under the
interrupt coalescing mode.

Or you can learn from SPDK to use one or small number of dedicated cores
or kernel threads to poll the interrupts from all reply queues, then I
guess you may benefit much compared with the extra 16 queue approach.

Introducing extra 16 queues just for interrupt coalescing and making it
coexisting with the regular 72 reply queues seems one very unusual use
case, not sure the current genirq affinity can support it well.

> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.

 irqbalance daemon can't cover managed interrupts, or you mean
you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?

Thanks,
Ming Lei


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-30 Thread Kashyap Desai
Hi Thomas, Ming, Chris et all,

Your input will help us to do changes for megaraid_sas driver.  We are
currently waiting for community response.

Is it recommended to use " pci_enable_msix_range" and have low level driver
do affinity setting because current APIs around pci_alloc_irq_vectors do not
meet our requirement.

We want more than online CPU msix vectors and using pre_vector we can do
that, but first 16 msix should be mapped to local numa node with effective
cpu spread across cpus of local numa node. This is not possible using
pci_alloc_irq_vectors_affinity.

Do we need kernel API changes or let's have low level driver to manage it
via irq_set_affinity_hint ?

Kashyap

> -Original Message-
> From: Sumit Saxena [mailto:sumit.sax...@broadcom.com]
> Sent: Wednesday, August 29, 2018 4:46 AM
> To: Ming Lei
> Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap
> Desai; Shivasharan Srikanteshwara
> Subject: RE: Affinity managed interrupts vs non-managed interrupts
>
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena 
> > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.
> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.
> >
> >
> > Thanks,
> > Ming


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-30 Thread Kashyap Desai
Hi Thomas, Ming, Chris et all,

Your input will help us to do changes for megaraid_sas driver.  We are
currently waiting for community response.

Is it recommended to use " pci_enable_msix_range" and have low level driver
do affinity setting because current APIs around pci_alloc_irq_vectors do not
meet our requirement.

We want more than online CPU msix vectors and using pre_vector we can do
that, but first 16 msix should be mapped to local numa node with effective
cpu spread across cpus of local numa node. This is not possible using
pci_alloc_irq_vectors_affinity.

Do we need kernel API changes or let's have low level driver to manage it
via irq_set_affinity_hint ?

Kashyap

> -Original Message-
> From: Sumit Saxena [mailto:sumit.sax...@broadcom.com]
> Sent: Wednesday, August 29, 2018 4:46 AM
> To: Ming Lei
> Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap
> Desai; Shivasharan Srikanteshwara
> Subject: RE: Affinity managed interrupts vs non-managed interrupts
>
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena 
> > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.
> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.
> >
> >
> > Thanks,
> > Ming


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-29 Thread Sumit Saxena
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Wednesday, August 29, 2018 2:16 PM
> To: Sumit Saxena 
> Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> Subject: Re: Affinity managed interrupts vs non-managed interrupts
>
> Hello Sumit,
Hi Ming,
Thanks for response.
>
> On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> >  Affinity managed interrupts vs non-managed interrupts
> >
> > Hi Thomas,
> >
> > We are working on next generation MegaRAID product where requirement
> > is- to allocate additional 16 MSI-x vectors in addition to number of
> > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > supports 128 MSI-x vectors.
> >
> > To explain the requirement and solution, consider that we have 2
> > socket system (each socket having 36 logical CPUs). Current driver
> > will allocate total 72 MSI-x vectors by calling API-
> > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > vectors will have affinity across NUMA node s and interrupts are
affinity
> managed.
> >
> > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > 16 and, driver can allocate 16 + 72 MSI-x vectors.
>
> Could you explain a bit what the specific use case the extra 16 vectors
is?
We are trying to avoid the penalty due to one interrupt per IO completion
and decided to coalesce interrupts on these extra 16 reply queues.
For regular 72 reply queues, we will not coalesce interrupts as for low IO
workload, interrupt coalescing may take more time due to less IO
completions.
In IO submission path, driver will decide which set of reply queues
(either extra 16 reply queues or regular 72 reply queues) to be picked
based on IO workload.
>
> >
> > All pre_vectors (16) will be mapped to all available online CPUs but e
> > ffective affinity of each vector is to CPU 0. Our requirement is to
> > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > effective CPU should be spread within local node cpu mask. Without
> > changing kernel code, we can
>
> If all CPUs in one NUMA node is offline, can this use case work as
expected?
> Seems we have to understand what the use case is and how it works.

Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
broken and irqbalancer takes care of migrating affected IRQs to online
CPUs of different NUMA node.
When offline CPUs are onlined again, irqbalancer restores affinity.
>
>
> Thanks,
> Ming


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-29 Thread Sumit Saxena
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Wednesday, August 29, 2018 2:16 PM
> To: Sumit Saxena 
> Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> Subject: Re: Affinity managed interrupts vs non-managed interrupts
>
> Hello Sumit,
Hi Ming,
Thanks for response.
>
> On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> >  Affinity managed interrupts vs non-managed interrupts
> >
> > Hi Thomas,
> >
> > We are working on next generation MegaRAID product where requirement
> > is- to allocate additional 16 MSI-x vectors in addition to number of
> > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > supports 128 MSI-x vectors.
> >
> > To explain the requirement and solution, consider that we have 2
> > socket system (each socket having 36 logical CPUs). Current driver
> > will allocate total 72 MSI-x vectors by calling API-
> > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > vectors will have affinity across NUMA node s and interrupts are
affinity
> managed.
> >
> > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > 16 and, driver can allocate 16 + 72 MSI-x vectors.
>
> Could you explain a bit what the specific use case the extra 16 vectors
is?
We are trying to avoid the penalty due to one interrupt per IO completion
and decided to coalesce interrupts on these extra 16 reply queues.
For regular 72 reply queues, we will not coalesce interrupts as for low IO
workload, interrupt coalescing may take more time due to less IO
completions.
In IO submission path, driver will decide which set of reply queues
(either extra 16 reply queues or regular 72 reply queues) to be picked
based on IO workload.
>
> >
> > All pre_vectors (16) will be mapped to all available online CPUs but e
> > ffective affinity of each vector is to CPU 0. Our requirement is to
> > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > effective CPU should be spread within local node cpu mask. Without
> > changing kernel code, we can
>
> If all CPUs in one NUMA node is offline, can this use case work as
expected?
> Seems we have to understand what the use case is and how it works.

Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
broken and irqbalancer takes care of migrating affected IRQs to online
CPUs of different NUMA node.
When offline CPUs are onlined again, irqbalancer restores affinity.
>
>
> Thanks,
> Ming


Re: Affinity managed interrupts vs non-managed interrupts

2018-08-29 Thread Ming Lei
Hello Sumit,

On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
>  Affinity managed interrupts vs non-managed interrupts
> 
> Hi Thomas,
> 
> We are working on next generation MegaRAID product where requirement is- to
> allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors
> megaraid_sas driver usually allocates.  MegaRAID adapter supports 128 MSI-x
> vectors.
> 
> To explain the requirement and solution, consider that we have 2 socket
> system (each socket having 36 logical CPUs). Current driver will allocate
> total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag-
> PCI_IRQ_AFFINITY).  All 72 MSI-x vectors will have affinity across NUMA node
> s and interrupts are affinity managed.
> 
> If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16
> and, driver can allocate 16 + 72 MSI-x vectors.

Could you explain a bit what the specific use case the extra 16 vectors
is?

> 
> All pre_vectors (16) will be mapped to all available online CPUs but e
> ffective affinity of each vector is to CPU 0. Our requirement is to have pre
> _vectors 16 reply queues to be mapped to local NUMA node with
> effective CPU should
> be spread within local node cpu mask. Without changing kernel code, we can

If all CPUs in one NUMA node is offline, can this use case work as expected?
Seems we have to understand what the use case is and how it works.


Thanks,
Ming


Re: Affinity managed interrupts vs non-managed interrupts

2018-08-29 Thread Ming Lei
Hello Sumit,

On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
>  Affinity managed interrupts vs non-managed interrupts
> 
> Hi Thomas,
> 
> We are working on next generation MegaRAID product where requirement is- to
> allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors
> megaraid_sas driver usually allocates.  MegaRAID adapter supports 128 MSI-x
> vectors.
> 
> To explain the requirement and solution, consider that we have 2 socket
> system (each socket having 36 logical CPUs). Current driver will allocate
> total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag-
> PCI_IRQ_AFFINITY).  All 72 MSI-x vectors will have affinity across NUMA node
> s and interrupts are affinity managed.
> 
> If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16
> and, driver can allocate 16 + 72 MSI-x vectors.

Could you explain a bit what the specific use case the extra 16 vectors
is?

> 
> All pre_vectors (16) will be mapped to all available online CPUs but e
> ffective affinity of each vector is to CPU 0. Our requirement is to have pre
> _vectors 16 reply queues to be mapped to local NUMA node with
> effective CPU should
> be spread within local node cpu mask. Without changing kernel code, we can

If all CPUs in one NUMA node is offline, can this use case work as expected?
Seems we have to understand what the use case is and how it works.


Thanks,
Ming