RE: [PATCH] Choose CPU based on allocated IRQs

2018-10-30 Thread Long Li
> Subject: Re: [PATCH] Choose CPU based on allocated IRQs
> 
> Long,
> 
> On Tue, 23 Oct 2018, Long Li wrote:
> 
> thanks for this patch.
> 
> A trivial formal thing ahead. The subject line
> 
>[PATCH] Choose CPU based on allocated IRQs
> 
> is lacking a proper subsystem prefix. In most cases you can figure the prefix
> out by running 'git log path/to/file' which in this case will show you that 
> most
> commits touching this file use the prefix 'genirq/matrix:'.
> 
> So the proper subject would be:
> 
>[PATCH] genirq/matrix: Choose CPU based on allocated IRQs
> 
> Subsystem prefixes are important to see where a patch belongs to right from
> the subject. Without that it could belong to any random part of the kernel
> and needs further inspection of the patch itself. This applies to both email
> and to git shortlog listings.

Thank you. I will send v2 to address this.

> 
> > From: Long Li 
> >
> > In irq_matrix, "available" is set when IRQs are allocated earlier in
> > the IRQ assigning process.
> >
> > Later, when IRQs are activated those values are not good indicators of
> > what CPU to choose to assign to this IRQ.
> 
> Can you please explain why you think that available is the wrong indicator
> and which problem you are trying to solve?
> 
> The WHY is really the most important part of a changelog.

The problem I'm seeing is that on a very large system with multiple devices of 
the same class (e.g. NVMe disks, using managed IRQs), they tend to use 
interrupts on several CPUs on the system. Under heavy load, those several CPUs 
are busy while other CPU are most idling. The issue is that when NVMe call 
irq_matrix_alloc_managed(), the assigned the CPU is always the first CPU in the 
cpumask, because they check for cpumap->available that will not change after 
managed IRQs are reserved in irq_matrix_reserve_managed (which was called from 
the 1st stage of IRQ setup in irq_domain_ops->alloc).

> 
> > Change to choose CPU for an IRQ based on how many IRQs are already
> > allocated on this CPU.
> 
> Looking deeper. The initial values are:
> 
> available = alloc_size - (managed + systembits)
> allocated = 0
> 
> There are two distinct functionalities which modify 'available' and 
> 'allocated'
> (omitting the reverse operations for simplicity):
> 
> 1) managed interrupts
> 
>reserve_managed()
>   managed++;
>   available--;
> 
>alloc_managed()
> allocated++;
> 
> 2) regular interrupts
> 
>alloc()
>   allocated++;
>   available--;
> 
> So 'available' can be lower than 'allocated' depending on the number of
> reserved managed interrupts, which have not yet been activated.
> 
> So for all regular interrupts we really want to look at the number of 
> 'available'
> vectors because the reserved managed ones are already accounted there
> and they need to be taken into account.

I think "reserved managed" may not always be accurate. Reserved managed IRQs 
may not always get activated. For an irq_data, when irq_matrix_reserve_managed 
is called, all the CPUs in the cpumask are reserved. Later, only one of them is 
activated via the call to irq_matrix_alloc_managed(). So we end up with a 
number of "reserved managed" that never get used.

> 
> For the spreading of managed interrupts in alloc_managed() that's indeed a
> different story and 'allocated' is more correct. But even that is not 
> completely
> accurate and can lead to the wrong result. The accurate solution would be to
> account the managed _and_ allocated vectors separately and do the
> spreading for managed interrupts based on that.

I think checking for "allocated" is the best approach for picking which CPU to 
assign for a given irq_data, since we really can't rely on "managed" to decide 
how busy this CPU really is. Checking for "allocated" should work for both 
unmanaged and managed IRQs.

> 
> Thanks,
> 
>   tglx


RE: [PATCH] Choose CPU based on allocated IRQs

2018-10-30 Thread Long Li
> Subject: Re: [PATCH] Choose CPU based on allocated IRQs
> 
> Long,
> 
> On Tue, 23 Oct 2018, Long Li wrote:
> 
> thanks for this patch.
> 
> A trivial formal thing ahead. The subject line
> 
>[PATCH] Choose CPU based on allocated IRQs
> 
> is lacking a proper subsystem prefix. In most cases you can figure the prefix
> out by running 'git log path/to/file' which in this case will show you that 
> most
> commits touching this file use the prefix 'genirq/matrix:'.
> 
> So the proper subject would be:
> 
>[PATCH] genirq/matrix: Choose CPU based on allocated IRQs
> 
> Subsystem prefixes are important to see where a patch belongs to right from
> the subject. Without that it could belong to any random part of the kernel
> and needs further inspection of the patch itself. This applies to both email
> and to git shortlog listings.

Thank you. I will send v2 to address this.

> 
> > From: Long Li 
> >
> > In irq_matrix, "available" is set when IRQs are allocated earlier in
> > the IRQ assigning process.
> >
> > Later, when IRQs are activated those values are not good indicators of
> > what CPU to choose to assign to this IRQ.
> 
> Can you please explain why you think that available is the wrong indicator
> and which problem you are trying to solve?
> 
> The WHY is really the most important part of a changelog.

The problem I'm seeing is that on a very large system with multiple devices of 
the same class (e.g. NVMe disks, using managed IRQs), they tend to use 
interrupts on several CPUs on the system. Under heavy load, those several CPUs 
are busy while other CPU are most idling. The issue is that when NVMe call 
irq_matrix_alloc_managed(), the assigned the CPU is always the first CPU in the 
cpumask, because they check for cpumap->available that will not change after 
managed IRQs are reserved in irq_matrix_reserve_managed (which was called from 
the 1st stage of IRQ setup in irq_domain_ops->alloc).

> 
> > Change to choose CPU for an IRQ based on how many IRQs are already
> > allocated on this CPU.
> 
> Looking deeper. The initial values are:
> 
> available = alloc_size - (managed + systembits)
> allocated = 0
> 
> There are two distinct functionalities which modify 'available' and 
> 'allocated'
> (omitting the reverse operations for simplicity):
> 
> 1) managed interrupts
> 
>reserve_managed()
>   managed++;
>   available--;
> 
>alloc_managed()
> allocated++;
> 
> 2) regular interrupts
> 
>alloc()
>   allocated++;
>   available--;
> 
> So 'available' can be lower than 'allocated' depending on the number of
> reserved managed interrupts, which have not yet been activated.
> 
> So for all regular interrupts we really want to look at the number of 
> 'available'
> vectors because the reserved managed ones are already accounted there
> and they need to be taken into account.

I think "reserved managed" may not always be accurate. Reserved managed IRQs 
may not always get activated. For an irq_data, when irq_matrix_reserve_managed 
is called, all the CPUs in the cpumask are reserved. Later, only one of them is 
activated via the call to irq_matrix_alloc_managed(). So we end up with a 
number of "reserved managed" that never get used.

> 
> For the spreading of managed interrupts in alloc_managed() that's indeed a
> different story and 'allocated' is more correct. But even that is not 
> completely
> accurate and can lead to the wrong result. The accurate solution would be to
> account the managed _and_ allocated vectors separately and do the
> spreading for managed interrupts based on that.

I think checking for "allocated" is the best approach for picking which CPU to 
assign for a given irq_data, since we really can't rely on "managed" to decide 
how busy this CPU really is. Checking for "allocated" should work for both 
unmanaged and managed IRQs.

> 
> Thanks,
> 
>   tglx


Re: [PATCH] Choose CPU based on allocated IRQs

2018-10-29 Thread Thomas Gleixner
Long,

On Tue, 23 Oct 2018, Long Li wrote:

thanks for this patch.

A trivial formal thing ahead. The subject line

   [PATCH] Choose CPU based on allocated IRQs

is lacking a proper subsystem prefix. In most cases you can figure the
prefix out by running 'git log path/to/file' which in this case will show
you that most commits touching this file use the prefix 'genirq/matrix:'.

So the proper subject would be:

   [PATCH] genirq/matrix: Choose CPU based on allocated IRQs

Subsystem prefixes are important to see where a patch belongs to right from
the subject. Without that it could belong to any random part of the kernel
and needs further inspection of the patch itself. This applies to both
email and to git shortlog listings.

> From: Long Li 
> 
> In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
> assigning process.
> 
> Later, when IRQs are activated those values are not good indicators of what
> CPU to choose to assign to this IRQ.

Can you please explain why you think that available is the wrong indicator
and which problem you are trying to solve?

The WHY is really the most important part of a changelog.

> Change to choose CPU for an IRQ based on how many IRQs are already allocated
> on this CPU.

Looking deeper. The initial values are:

available = alloc_size - (managed + systembits)
allocated = 0

There are two distinct functionalities which modify 'available' and
'allocated' (omitting the reverse operations for simplicity):

1) managed interrupts

   reserve_managed()
managed++;
available--;

   alloc_managed()
allocated++;

2) regular interrupts

   alloc()
allocated++;
available--;

So 'available' can be lower than 'allocated' depending on the number of
reserved managed interrupts, which have not yet been activated.

So for all regular interrupts we really want to look at the number of
'available' vectors because the reserved managed ones are already accounted
there and they need to be taken into account.

For the spreading of managed interrupts in alloc_managed() that's indeed a
different story and 'allocated' is more correct. But even that is not
completely accurate and can lead to the wrong result. The accurate solution
would be to account the managed _and_ allocated vectors separately and do
the spreading for managed interrupts based on that.

Thanks,

tglx


Re: [PATCH] Choose CPU based on allocated IRQs

2018-10-29 Thread Thomas Gleixner
Long,

On Tue, 23 Oct 2018, Long Li wrote:

thanks for this patch.

A trivial formal thing ahead. The subject line

   [PATCH] Choose CPU based on allocated IRQs

is lacking a proper subsystem prefix. In most cases you can figure the
prefix out by running 'git log path/to/file' which in this case will show
you that most commits touching this file use the prefix 'genirq/matrix:'.

So the proper subject would be:

   [PATCH] genirq/matrix: Choose CPU based on allocated IRQs

Subsystem prefixes are important to see where a patch belongs to right from
the subject. Without that it could belong to any random part of the kernel
and needs further inspection of the patch itself. This applies to both
email and to git shortlog listings.

> From: Long Li 
> 
> In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
> assigning process.
> 
> Later, when IRQs are activated those values are not good indicators of what
> CPU to choose to assign to this IRQ.

Can you please explain why you think that available is the wrong indicator
and which problem you are trying to solve?

The WHY is really the most important part of a changelog.

> Change to choose CPU for an IRQ based on how many IRQs are already allocated
> on this CPU.

Looking deeper. The initial values are:

available = alloc_size - (managed + systembits)
allocated = 0

There are two distinct functionalities which modify 'available' and
'allocated' (omitting the reverse operations for simplicity):

1) managed interrupts

   reserve_managed()
managed++;
available--;

   alloc_managed()
allocated++;

2) regular interrupts

   alloc()
allocated++;
available--;

So 'available' can be lower than 'allocated' depending on the number of
reserved managed interrupts, which have not yet been activated.

So for all regular interrupts we really want to look at the number of
'available' vectors because the reserved managed ones are already accounted
there and they need to be taken into account.

For the spreading of managed interrupts in alloc_managed() that's indeed a
different story and 'allocated' is more correct. But even that is not
completely accurate and can lead to the wrong result. The accurate solution
would be to account the managed _and_ allocated vectors separately and do
the spreading for managed interrupts based on that.

Thanks,

tglx


RE: [PATCH] Choose CPU based on allocated IRQs

2018-10-29 Thread Michael Kelley
From: Long Li   Sent: Monday, October 22, 2018 6:41 PM
> 
> In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
> assigning process.
> 
> Later, when IRQs are activated those values are not good indicators of what
> CPU to choose to assign to this IRQ.
> 
> Change to choose CPU for an IRQ based on how many IRQs are already allocated
> on this CPU.
> 
> Signed-off-by: Long Li 

Reviewed-by:  Michael Kelley 


RE: [PATCH] Choose CPU based on allocated IRQs

2018-10-29 Thread Michael Kelley
From: Long Li   Sent: Monday, October 22, 2018 6:41 PM
> 
> In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
> assigning process.
> 
> Later, when IRQs are activated those values are not good indicators of what
> CPU to choose to assign to this IRQ.
> 
> Change to choose CPU for an IRQ based on how many IRQs are already allocated
> on this CPU.
> 
> Signed-off-by: Long Li 

Reviewed-by:  Michael Kelley