Re: Observing Softlockup's while running heavy IOs

2016-09-12 Thread Neil Horman
On Mon, Sep 12, 2016 at 01:48:39PM +0530, Sreekanth Reddy wrote:
> On Thu, Sep 8, 2016 at 7:09 PM, Neil Horman  wrote:
> > On Thu, Sep 08, 2016 at 11:12:40AM +0530, Sreekanth Reddy wrote:
> >> On Wed, Sep 7, 2016 at 6:54 PM, Neil Horman  wrote:
> >> > On Wed, Sep 07, 2016 at 11:30:04AM +0530, Sreekanth Reddy wrote:
> >> >> On Tue, Sep 6, 2016 at 8:36 PM, Neil Horman  
> >> >> wrote:
> >> >> > On Tue, Sep 06, 2016 at 04:52:37PM +0530, Sreekanth Reddy wrote:
> >> >> >> On Fri, Sep 2, 2016 at 4:34 AM, Bart Van Assche
> >> >> >>  wrote:
> >> >> >> > On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:
> >> >> >> >>
> >> >> >> >> I reduced the ISR workload by one third in-order to reduce the 
> >> >> >> >> time
> >> >> >> >> that is spent per CPU in interrupt context, even then I am 
> >> >> >> >> observing
> >> >> >> >> softlockups.
> >> >> >> >>
> >> >> >> >> As I mentioned before only same single CPU in the set of 
> >> >> >> >> CPUs(enabled
> >> >> >> >> in affinity_hint) is busy with handling the interrupts from
> >> >> >> >> corresponding IRQx. I have done below experiment in driver to 
> >> >> >> >> limit
> >> >> >> >> these softlockups/hardlockups. But I am not sure whether it is
> >> >> >> >> reasonable to do this in driver,
> >> >> >> >>
> >> >> >> >> Experiment:
> >> >> >> >> If the CPUx is continuously busy with handling the remote CPUs
> >> >> >> >> (enabled in the corresponding IRQ's affinity_hint) IO works by 
> >> >> >> >> 1/4th
> >> >> >> >> of the HBA queue depth in the same ISR context then enable a flag
> >> >> >> >> called 'change_smp_affinity' for this IRQ. Also created a thread 
> >> >> >> >> with
> >> >> >> >> will poll for this flag for every IRQ's (enabled by driver) for 
> >> >> >> >> every
> >> >> >> >> second. If this thread see that this flag is enabled for any IRQ 
> >> >> >> >> then
> >> >> >> >> it will write next CPU number from the CPUs enabled in the IRQ's
> >> >> >> >> affinity_hint to the IRQ's smp_affinity procfs attribute using
> >> >> >> >> 'call_usermodehelper()' API.
> >> >> >> >>
> >> >> >> >> This to make sure that interrupts are not processed by same 
> >> >> >> >> single CPU
> >> >> >> >> all the time and to make the other CPUs to handle the interrupts 
> >> >> >> >> if
> >> >> >> >> the current CPU is continuously busy with handling the other CPUs 
> >> >> >> >> IO
> >> >> >> >> interrupts.
> >> >> >> >>
> >> >> >> >> For example consider a system which has 8 logical CPUs and one 
> >> >> >> >> MSIx
> >> >> >> >> vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
> >> >> >> >> then IRQ's procfs attributes will be
> >> >> >> >> IRQ# 120, affinity_hint=0xff, smp_affinity=0x00
> >> >> >> >>
> >> >> >> >> After starting heavy IOs, we will observe that only CPU0 will be 
> >> >> >> >> busy
> >> >> >> >> with handling the interrupts. This experiment driver will change 
> >> >> >> >> the
> >> >> >> >> smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
> >> >> >> >> /proc/irq/120/smp_affinity', driver issue's this cmd using
> >> >> >> >> call_usermodehelper() API) if it observes that CPU0 is 
> >> >> >> >> continuously
> >> >> >> >> processing more than 2K of IOs replies of other CPUs i.e from 
> >> >> >> >> CPU1 to
> >> >> >> >> CPU7.
> >> >> >> >>
> >> >> >> >> Whether doing this kind of stuff in driver is ok?
> >> >> >> >
> >> >> >> >
> >> >> >> > Hello Sreekanth,
> >> >> >> >
> >> >> >> > To me this sounds like something that should be implemented in the 
> >> >> >> > I/O
> >> >> >> > chipset on the motherboard. If you have a look at the Intel 
> >> >> >> > Software
> >> >> >> > Developer Manuals then you will see that logical destination mode 
> >> >> >> > supports
> >> >> >> > round-robin interrupt delivery. However, the Linux kernel selects 
> >> >> >> > physical
> >> >> >> > destination mode on systems with more than eight logical CPUs (see 
> >> >> >> > also
> >> >> >> > arch/x86/kernel/apic/apic_flat_64.c).
> >> >> >> >
> >> >> >> > I'm not sure the maintainers of the interrupt subsystem would 
> >> >> >> > welcome code
> >> >> >> > that emulates round-robin interrupt delivery. So your best option 
> >> >> >> > is
> >> >> >> > probably to minimize the amount of work that is done in interrupt 
> >> >> >> > context
> >> >> >> > and to move as much work as possible out of interrupt context in 
> >> >> >> > such a way
> >> >> >> > that it can be spread over multiple CPU cores, e.g. by using
> >> >> >> > queue_work_on().
> >> >> >> >
> >> >> >> > Bart.
> >> >> >>
> >> >> >> Bart,
> >> >> >>
> >> >> >> Thanks a lot for providing lot of inputs and valuable information on 
> >> >> >> this issue.
> >> >> >>
> >> >> >> Today I got one more observation. i.e. I am not observing any lockups
> >> >> >> if I use 1.0.4-6 versioned irqbalance.
> >> >> >> Since this versioned irqbalance is able to shift the load to other 
> >> >> >> CPU
> >> >> >> when one CPU is heavily loaded.
> >> >> >>
> >> >> >
> >> >> > This isn't happening because irqbalance is no l

Re: Observing Softlockup's while running heavy IOs

2016-09-12 Thread Sreekanth Reddy
On Thu, Sep 8, 2016 at 7:09 PM, Neil Horman  wrote:
> On Thu, Sep 08, 2016 at 11:12:40AM +0530, Sreekanth Reddy wrote:
>> On Wed, Sep 7, 2016 at 6:54 PM, Neil Horman  wrote:
>> > On Wed, Sep 07, 2016 at 11:30:04AM +0530, Sreekanth Reddy wrote:
>> >> On Tue, Sep 6, 2016 at 8:36 PM, Neil Horman  wrote:
>> >> > On Tue, Sep 06, 2016 at 04:52:37PM +0530, Sreekanth Reddy wrote:
>> >> >> On Fri, Sep 2, 2016 at 4:34 AM, Bart Van Assche
>> >> >>  wrote:
>> >> >> > On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:
>> >> >> >>
>> >> >> >> I reduced the ISR workload by one third in-order to reduce the time
>> >> >> >> that is spent per CPU in interrupt context, even then I am observing
>> >> >> >> softlockups.
>> >> >> >>
>> >> >> >> As I mentioned before only same single CPU in the set of 
>> >> >> >> CPUs(enabled
>> >> >> >> in affinity_hint) is busy with handling the interrupts from
>> >> >> >> corresponding IRQx. I have done below experiment in driver to limit
>> >> >> >> these softlockups/hardlockups. But I am not sure whether it is
>> >> >> >> reasonable to do this in driver,
>> >> >> >>
>> >> >> >> Experiment:
>> >> >> >> If the CPUx is continuously busy with handling the remote CPUs
>> >> >> >> (enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
>> >> >> >> of the HBA queue depth in the same ISR context then enable a flag
>> >> >> >> called 'change_smp_affinity' for this IRQ. Also created a thread 
>> >> >> >> with
>> >> >> >> will poll for this flag for every IRQ's (enabled by driver) for 
>> >> >> >> every
>> >> >> >> second. If this thread see that this flag is enabled for any IRQ 
>> >> >> >> then
>> >> >> >> it will write next CPU number from the CPUs enabled in the IRQ's
>> >> >> >> affinity_hint to the IRQ's smp_affinity procfs attribute using
>> >> >> >> 'call_usermodehelper()' API.
>> >> >> >>
>> >> >> >> This to make sure that interrupts are not processed by same single 
>> >> >> >> CPU
>> >> >> >> all the time and to make the other CPUs to handle the interrupts if
>> >> >> >> the current CPU is continuously busy with handling the other CPUs IO
>> >> >> >> interrupts.
>> >> >> >>
>> >> >> >> For example consider a system which has 8 logical CPUs and one MSIx
>> >> >> >> vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
>> >> >> >> then IRQ's procfs attributes will be
>> >> >> >> IRQ# 120, affinity_hint=0xff, smp_affinity=0x00
>> >> >> >>
>> >> >> >> After starting heavy IOs, we will observe that only CPU0 will be 
>> >> >> >> busy
>> >> >> >> with handling the interrupts. This experiment driver will change the
>> >> >> >> smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
>> >> >> >> /proc/irq/120/smp_affinity', driver issue's this cmd using
>> >> >> >> call_usermodehelper() API) if it observes that CPU0 is continuously
>> >> >> >> processing more than 2K of IOs replies of other CPUs i.e from CPU1 
>> >> >> >> to
>> >> >> >> CPU7.
>> >> >> >>
>> >> >> >> Whether doing this kind of stuff in driver is ok?
>> >> >> >
>> >> >> >
>> >> >> > Hello Sreekanth,
>> >> >> >
>> >> >> > To me this sounds like something that should be implemented in the 
>> >> >> > I/O
>> >> >> > chipset on the motherboard. If you have a look at the Intel Software
>> >> >> > Developer Manuals then you will see that logical destination mode 
>> >> >> > supports
>> >> >> > round-robin interrupt delivery. However, the Linux kernel selects 
>> >> >> > physical
>> >> >> > destination mode on systems with more than eight logical CPUs (see 
>> >> >> > also
>> >> >> > arch/x86/kernel/apic/apic_flat_64.c).
>> >> >> >
>> >> >> > I'm not sure the maintainers of the interrupt subsystem would 
>> >> >> > welcome code
>> >> >> > that emulates round-robin interrupt delivery. So your best option is
>> >> >> > probably to minimize the amount of work that is done in interrupt 
>> >> >> > context
>> >> >> > and to move as much work as possible out of interrupt context in 
>> >> >> > such a way
>> >> >> > that it can be spread over multiple CPU cores, e.g. by using
>> >> >> > queue_work_on().
>> >> >> >
>> >> >> > Bart.
>> >> >>
>> >> >> Bart,
>> >> >>
>> >> >> Thanks a lot for providing lot of inputs and valuable information on 
>> >> >> this issue.
>> >> >>
>> >> >> Today I got one more observation. i.e. I am not observing any lockups
>> >> >> if I use 1.0.4-6 versioned irqbalance.
>> >> >> Since this versioned irqbalance is able to shift the load to other CPU
>> >> >> when one CPU is heavily loaded.
>> >> >>
>> >> >
>> >> > This isn't happening because irqbalance is no longer able to shift load 
>> >> > between
>> >> > cpus, its happening because of commit 
>> >> > 996ee2cf7a4d10454de68ac4978adb5cf22850f8.
>> >> > irqs with higher interrupt volumes sould be balanced to a specific cpu 
>> >> > core,
>> >> > rather than to a cache domain to maximize cpu-local cache hit rates.  
>> >> > Prior to
>> >> > that change we balanced to a cache domain and your workload didn't have 
>> >> > to
>> >> > serialize mu

Re: Observing Softlockup's while running heavy IOs

2016-09-08 Thread Neil Horman
On Thu, Sep 08, 2016 at 11:12:40AM +0530, Sreekanth Reddy wrote:
> On Wed, Sep 7, 2016 at 6:54 PM, Neil Horman  wrote:
> > On Wed, Sep 07, 2016 at 11:30:04AM +0530, Sreekanth Reddy wrote:
> >> On Tue, Sep 6, 2016 at 8:36 PM, Neil Horman  wrote:
> >> > On Tue, Sep 06, 2016 at 04:52:37PM +0530, Sreekanth Reddy wrote:
> >> >> On Fri, Sep 2, 2016 at 4:34 AM, Bart Van Assche
> >> >>  wrote:
> >> >> > On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:
> >> >> >>
> >> >> >> I reduced the ISR workload by one third in-order to reduce the time
> >> >> >> that is spent per CPU in interrupt context, even then I am observing
> >> >> >> softlockups.
> >> >> >>
> >> >> >> As I mentioned before only same single CPU in the set of CPUs(enabled
> >> >> >> in affinity_hint) is busy with handling the interrupts from
> >> >> >> corresponding IRQx. I have done below experiment in driver to limit
> >> >> >> these softlockups/hardlockups. But I am not sure whether it is
> >> >> >> reasonable to do this in driver,
> >> >> >>
> >> >> >> Experiment:
> >> >> >> If the CPUx is continuously busy with handling the remote CPUs
> >> >> >> (enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
> >> >> >> of the HBA queue depth in the same ISR context then enable a flag
> >> >> >> called 'change_smp_affinity' for this IRQ. Also created a thread with
> >> >> >> will poll for this flag for every IRQ's (enabled by driver) for every
> >> >> >> second. If this thread see that this flag is enabled for any IRQ then
> >> >> >> it will write next CPU number from the CPUs enabled in the IRQ's
> >> >> >> affinity_hint to the IRQ's smp_affinity procfs attribute using
> >> >> >> 'call_usermodehelper()' API.
> >> >> >>
> >> >> >> This to make sure that interrupts are not processed by same single 
> >> >> >> CPU
> >> >> >> all the time and to make the other CPUs to handle the interrupts if
> >> >> >> the current CPU is continuously busy with handling the other CPUs IO
> >> >> >> interrupts.
> >> >> >>
> >> >> >> For example consider a system which has 8 logical CPUs and one MSIx
> >> >> >> vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
> >> >> >> then IRQ's procfs attributes will be
> >> >> >> IRQ# 120, affinity_hint=0xff, smp_affinity=0x00
> >> >> >>
> >> >> >> After starting heavy IOs, we will observe that only CPU0 will be busy
> >> >> >> with handling the interrupts. This experiment driver will change the
> >> >> >> smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
> >> >> >> /proc/irq/120/smp_affinity', driver issue's this cmd using
> >> >> >> call_usermodehelper() API) if it observes that CPU0 is continuously
> >> >> >> processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
> >> >> >> CPU7.
> >> >> >>
> >> >> >> Whether doing this kind of stuff in driver is ok?
> >> >> >
> >> >> >
> >> >> > Hello Sreekanth,
> >> >> >
> >> >> > To me this sounds like something that should be implemented in the I/O
> >> >> > chipset on the motherboard. If you have a look at the Intel Software
> >> >> > Developer Manuals then you will see that logical destination mode 
> >> >> > supports
> >> >> > round-robin interrupt delivery. However, the Linux kernel selects 
> >> >> > physical
> >> >> > destination mode on systems with more than eight logical CPUs (see 
> >> >> > also
> >> >> > arch/x86/kernel/apic/apic_flat_64.c).
> >> >> >
> >> >> > I'm not sure the maintainers of the interrupt subsystem would welcome 
> >> >> > code
> >> >> > that emulates round-robin interrupt delivery. So your best option is
> >> >> > probably to minimize the amount of work that is done in interrupt 
> >> >> > context
> >> >> > and to move as much work as possible out of interrupt context in such 
> >> >> > a way
> >> >> > that it can be spread over multiple CPU cores, e.g. by using
> >> >> > queue_work_on().
> >> >> >
> >> >> > Bart.
> >> >>
> >> >> Bart,
> >> >>
> >> >> Thanks a lot for providing lot of inputs and valuable information on 
> >> >> this issue.
> >> >>
> >> >> Today I got one more observation. i.e. I am not observing any lockups
> >> >> if I use 1.0.4-6 versioned irqbalance.
> >> >> Since this versioned irqbalance is able to shift the load to other CPU
> >> >> when one CPU is heavily loaded.
> >> >>
> >> >
> >> > This isn't happening because irqbalance is no longer able to shift load 
> >> > between
> >> > cpus, its happening because of commit 
> >> > 996ee2cf7a4d10454de68ac4978adb5cf22850f8.
> >> > irqs with higher interrupt volumes sould be balanced to a specific cpu 
> >> > core,
> >> > rather than to a cache domain to maximize cpu-local cache hit rates.  
> >> > Prior to
> >> > that change we balanced to a cache domain and your workload didn't have 
> >> > to
> >> > serialize multiple interrupts to a single core.  My suggestion to you is 
> >> > to use
> >> > the --policyscript option to make your storage irqs get balanced to the 
> >> > cache
> >> > level, rather than the core level.  That should return the behavi

Re: Observing Softlockup's while running heavy IOs

2016-09-07 Thread Sreekanth Reddy
On Wed, Sep 7, 2016 at 6:54 PM, Neil Horman  wrote:
> On Wed, Sep 07, 2016 at 11:30:04AM +0530, Sreekanth Reddy wrote:
>> On Tue, Sep 6, 2016 at 8:36 PM, Neil Horman  wrote:
>> > On Tue, Sep 06, 2016 at 04:52:37PM +0530, Sreekanth Reddy wrote:
>> >> On Fri, Sep 2, 2016 at 4:34 AM, Bart Van Assche
>> >>  wrote:
>> >> > On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:
>> >> >>
>> >> >> I reduced the ISR workload by one third in-order to reduce the time
>> >> >> that is spent per CPU in interrupt context, even then I am observing
>> >> >> softlockups.
>> >> >>
>> >> >> As I mentioned before only same single CPU in the set of CPUs(enabled
>> >> >> in affinity_hint) is busy with handling the interrupts from
>> >> >> corresponding IRQx. I have done below experiment in driver to limit
>> >> >> these softlockups/hardlockups. But I am not sure whether it is
>> >> >> reasonable to do this in driver,
>> >> >>
>> >> >> Experiment:
>> >> >> If the CPUx is continuously busy with handling the remote CPUs
>> >> >> (enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
>> >> >> of the HBA queue depth in the same ISR context then enable a flag
>> >> >> called 'change_smp_affinity' for this IRQ. Also created a thread with
>> >> >> will poll for this flag for every IRQ's (enabled by driver) for every
>> >> >> second. If this thread see that this flag is enabled for any IRQ then
>> >> >> it will write next CPU number from the CPUs enabled in the IRQ's
>> >> >> affinity_hint to the IRQ's smp_affinity procfs attribute using
>> >> >> 'call_usermodehelper()' API.
>> >> >>
>> >> >> This to make sure that interrupts are not processed by same single CPU
>> >> >> all the time and to make the other CPUs to handle the interrupts if
>> >> >> the current CPU is continuously busy with handling the other CPUs IO
>> >> >> interrupts.
>> >> >>
>> >> >> For example consider a system which has 8 logical CPUs and one MSIx
>> >> >> vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
>> >> >> then IRQ's procfs attributes will be
>> >> >> IRQ# 120, affinity_hint=0xff, smp_affinity=0x00
>> >> >>
>> >> >> After starting heavy IOs, we will observe that only CPU0 will be busy
>> >> >> with handling the interrupts. This experiment driver will change the
>> >> >> smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
>> >> >> /proc/irq/120/smp_affinity', driver issue's this cmd using
>> >> >> call_usermodehelper() API) if it observes that CPU0 is continuously
>> >> >> processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
>> >> >> CPU7.
>> >> >>
>> >> >> Whether doing this kind of stuff in driver is ok?
>> >> >
>> >> >
>> >> > Hello Sreekanth,
>> >> >
>> >> > To me this sounds like something that should be implemented in the I/O
>> >> > chipset on the motherboard. If you have a look at the Intel Software
>> >> > Developer Manuals then you will see that logical destination mode 
>> >> > supports
>> >> > round-robin interrupt delivery. However, the Linux kernel selects 
>> >> > physical
>> >> > destination mode on systems with more than eight logical CPUs (see also
>> >> > arch/x86/kernel/apic/apic_flat_64.c).
>> >> >
>> >> > I'm not sure the maintainers of the interrupt subsystem would welcome 
>> >> > code
>> >> > that emulates round-robin interrupt delivery. So your best option is
>> >> > probably to minimize the amount of work that is done in interrupt 
>> >> > context
>> >> > and to move as much work as possible out of interrupt context in such a 
>> >> > way
>> >> > that it can be spread over multiple CPU cores, e.g. by using
>> >> > queue_work_on().
>> >> >
>> >> > Bart.
>> >>
>> >> Bart,
>> >>
>> >> Thanks a lot for providing lot of inputs and valuable information on this 
>> >> issue.
>> >>
>> >> Today I got one more observation. i.e. I am not observing any lockups
>> >> if I use 1.0.4-6 versioned irqbalance.
>> >> Since this versioned irqbalance is able to shift the load to other CPU
>> >> when one CPU is heavily loaded.
>> >>
>> >
>> > This isn't happening because irqbalance is no longer able to shift load 
>> > between
>> > cpus, its happening because of commit 
>> > 996ee2cf7a4d10454de68ac4978adb5cf22850f8.
>> > irqs with higher interrupt volumes sould be balanced to a specific cpu 
>> > core,
>> > rather than to a cache domain to maximize cpu-local cache hit rates.  
>> > Prior to
>> > that change we balanced to a cache domain and your workload didn't have to
>> > serialize multiple interrupts to a single core.  My suggestion to you is 
>> > to use
>> > the --policyscript option to make your storage irqs get balanced to the 
>> > cache
>> > level, rather than the core level.  That should return the behavior to 
>> > what you
>> > want.
>> >
>> > Neil
>>
>> Hi Neil,
>>
>> Thanks for reply.
>>
>> Today I tried with setting balance_level to 'cache' for mpt3sas driver
>> IRQ's using below policy script and used 1.0.9 versioned irqbalance,
>> --

Re: Observing Softlockup's while running heavy IOs

2016-09-07 Thread Neil Horman
On Wed, Sep 07, 2016 at 11:30:04AM +0530, Sreekanth Reddy wrote:
> On Tue, Sep 6, 2016 at 8:36 PM, Neil Horman  wrote:
> > On Tue, Sep 06, 2016 at 04:52:37PM +0530, Sreekanth Reddy wrote:
> >> On Fri, Sep 2, 2016 at 4:34 AM, Bart Van Assche
> >>  wrote:
> >> > On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:
> >> >>
> >> >> I reduced the ISR workload by one third in-order to reduce the time
> >> >> that is spent per CPU in interrupt context, even then I am observing
> >> >> softlockups.
> >> >>
> >> >> As I mentioned before only same single CPU in the set of CPUs(enabled
> >> >> in affinity_hint) is busy with handling the interrupts from
> >> >> corresponding IRQx. I have done below experiment in driver to limit
> >> >> these softlockups/hardlockups. But I am not sure whether it is
> >> >> reasonable to do this in driver,
> >> >>
> >> >> Experiment:
> >> >> If the CPUx is continuously busy with handling the remote CPUs
> >> >> (enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
> >> >> of the HBA queue depth in the same ISR context then enable a flag
> >> >> called 'change_smp_affinity' for this IRQ. Also created a thread with
> >> >> will poll for this flag for every IRQ's (enabled by driver) for every
> >> >> second. If this thread see that this flag is enabled for any IRQ then
> >> >> it will write next CPU number from the CPUs enabled in the IRQ's
> >> >> affinity_hint to the IRQ's smp_affinity procfs attribute using
> >> >> 'call_usermodehelper()' API.
> >> >>
> >> >> This to make sure that interrupts are not processed by same single CPU
> >> >> all the time and to make the other CPUs to handle the interrupts if
> >> >> the current CPU is continuously busy with handling the other CPUs IO
> >> >> interrupts.
> >> >>
> >> >> For example consider a system which has 8 logical CPUs and one MSIx
> >> >> vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
> >> >> then IRQ's procfs attributes will be
> >> >> IRQ# 120, affinity_hint=0xff, smp_affinity=0x00
> >> >>
> >> >> After starting heavy IOs, we will observe that only CPU0 will be busy
> >> >> with handling the interrupts. This experiment driver will change the
> >> >> smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
> >> >> /proc/irq/120/smp_affinity', driver issue's this cmd using
> >> >> call_usermodehelper() API) if it observes that CPU0 is continuously
> >> >> processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
> >> >> CPU7.
> >> >>
> >> >> Whether doing this kind of stuff in driver is ok?
> >> >
> >> >
> >> > Hello Sreekanth,
> >> >
> >> > To me this sounds like something that should be implemented in the I/O
> >> > chipset on the motherboard. If you have a look at the Intel Software
> >> > Developer Manuals then you will see that logical destination mode 
> >> > supports
> >> > round-robin interrupt delivery. However, the Linux kernel selects 
> >> > physical
> >> > destination mode on systems with more than eight logical CPUs (see also
> >> > arch/x86/kernel/apic/apic_flat_64.c).
> >> >
> >> > I'm not sure the maintainers of the interrupt subsystem would welcome 
> >> > code
> >> > that emulates round-robin interrupt delivery. So your best option is
> >> > probably to minimize the amount of work that is done in interrupt context
> >> > and to move as much work as possible out of interrupt context in such a 
> >> > way
> >> > that it can be spread over multiple CPU cores, e.g. by using
> >> > queue_work_on().
> >> >
> >> > Bart.
> >>
> >> Bart,
> >>
> >> Thanks a lot for providing lot of inputs and valuable information on this 
> >> issue.
> >>
> >> Today I got one more observation. i.e. I am not observing any lockups
> >> if I use 1.0.4-6 versioned irqbalance.
> >> Since this versioned irqbalance is able to shift the load to other CPU
> >> when one CPU is heavily loaded.
> >>
> >
> > This isn't happening because irqbalance is no longer able to shift load 
> > between
> > cpus, its happening because of commit 
> > 996ee2cf7a4d10454de68ac4978adb5cf22850f8.
> > irqs with higher interrupt volumes sould be balanced to a specific cpu core,
> > rather than to a cache domain to maximize cpu-local cache hit rates.  Prior 
> > to
> > that change we balanced to a cache domain and your workload didn't have to
> > serialize multiple interrupts to a single core.  My suggestion to you is to 
> > use
> > the --policyscript option to make your storage irqs get balanced to the 
> > cache
> > level, rather than the core level.  That should return the behavior to what 
> > you
> > want.
> >
> > Neil
> 
> Hi Neil,
> 
> Thanks for reply.
> 
> Today I tried with setting balance_level to 'cache' for mpt3sas driver
> IRQ's using below policy script and used 1.0.9 versioned irqbalance,
> --
> #!/bin/bash
> # Header
> # Linux Shell Scripting for Irq Balance Policy select for mpt3sas driver
> #
> 
> # Command Line Args
> 

Re: Observing Softlockup's while running heavy IOs

2016-09-06 Thread Neil Horman
On Tue, Sep 06, 2016 at 04:52:37PM +0530, Sreekanth Reddy wrote:
> On Fri, Sep 2, 2016 at 4:34 AM, Bart Van Assche
>  wrote:
> > On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:
> >>
> >> I reduced the ISR workload by one third in-order to reduce the time
> >> that is spent per CPU in interrupt context, even then I am observing
> >> softlockups.
> >>
> >> As I mentioned before only same single CPU in the set of CPUs(enabled
> >> in affinity_hint) is busy with handling the interrupts from
> >> corresponding IRQx. I have done below experiment in driver to limit
> >> these softlockups/hardlockups. But I am not sure whether it is
> >> reasonable to do this in driver,
> >>
> >> Experiment:
> >> If the CPUx is continuously busy with handling the remote CPUs
> >> (enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
> >> of the HBA queue depth in the same ISR context then enable a flag
> >> called 'change_smp_affinity' for this IRQ. Also created a thread with
> >> will poll for this flag for every IRQ's (enabled by driver) for every
> >> second. If this thread see that this flag is enabled for any IRQ then
> >> it will write next CPU number from the CPUs enabled in the IRQ's
> >> affinity_hint to the IRQ's smp_affinity procfs attribute using
> >> 'call_usermodehelper()' API.
> >>
> >> This to make sure that interrupts are not processed by same single CPU
> >> all the time and to make the other CPUs to handle the interrupts if
> >> the current CPU is continuously busy with handling the other CPUs IO
> >> interrupts.
> >>
> >> For example consider a system which has 8 logical CPUs and one MSIx
> >> vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
> >> then IRQ's procfs attributes will be
> >> IRQ# 120, affinity_hint=0xff, smp_affinity=0x00
> >>
> >> After starting heavy IOs, we will observe that only CPU0 will be busy
> >> with handling the interrupts. This experiment driver will change the
> >> smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
> >> /proc/irq/120/smp_affinity', driver issue's this cmd using
> >> call_usermodehelper() API) if it observes that CPU0 is continuously
> >> processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
> >> CPU7.
> >>
> >> Whether doing this kind of stuff in driver is ok?
> >
> >
> > Hello Sreekanth,
> >
> > To me this sounds like something that should be implemented in the I/O
> > chipset on the motherboard. If you have a look at the Intel Software
> > Developer Manuals then you will see that logical destination mode supports
> > round-robin interrupt delivery. However, the Linux kernel selects physical
> > destination mode on systems with more than eight logical CPUs (see also
> > arch/x86/kernel/apic/apic_flat_64.c).
> >
> > I'm not sure the maintainers of the interrupt subsystem would welcome code
> > that emulates round-robin interrupt delivery. So your best option is
> > probably to minimize the amount of work that is done in interrupt context
> > and to move as much work as possible out of interrupt context in such a way
> > that it can be spread over multiple CPU cores, e.g. by using
> > queue_work_on().
> >
> > Bart.
> 
> Bart,
> 
> Thanks a lot for providing lot of inputs and valuable information on this 
> issue.
> 
> Today I got one more observation. i.e. I am not observing any lockups
> if I use 1.0.4-6 versioned irqbalance.
> Since this versioned irqbalance is able to shift the load to other CPU
> when one CPU is heavily loaded.
> 

This isn't happening because irqbalance is no longer able to shift load between
cpus, its happening because of commit 996ee2cf7a4d10454de68ac4978adb5cf22850f8.
irqs with higher interrupt volumes sould be balanced to a specific cpu core,
rather than to a cache domain to maximize cpu-local cache hit rates.  Prior to
that change we balanced to a cache domain and your workload didn't have to
serialize multiple interrupts to a single core.  My suggestion to you is to use
the --policyscript option to make your storage irqs get balanced to the cache
level, rather than the core level.  That should return the behavior to what you
want.

Neil

> while running heavy IOs, for first few seconds here is my driver irq's
> attributes,
> 
> ioc number = 0
> number of core processors = 24
> msix vector count = 2
> number of cores per msix vector = 16
> 
> 
> msix index = 0, irq number =  50, smp_affinity = 40
> affinity_hint = 000fff
> msix index = 1, irq number =  51, smp_affinity = 001000
> affinity_hint = fff000
> 
> We have set affinity for 2 msix vectors and 24 core processors
> --
> 
> After few seconds it observed that CPU12 is heavily loaded for IRQ 51
> and it changed the smp_affinity to CPU21
> 

Re: Observing Softlockup's while running heavy IOs

2016-09-01 Thread Bart Van Assche

On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:

I reduced the ISR workload by one third in-order to reduce the time
that is spent per CPU in interrupt context, even then I am observing
softlockups.

As I mentioned before only same single CPU in the set of CPUs(enabled
in affinity_hint) is busy with handling the interrupts from
corresponding IRQx. I have done below experiment in driver to limit
these softlockups/hardlockups. But I am not sure whether it is
reasonable to do this in driver,

Experiment:
If the CPUx is continuously busy with handling the remote CPUs
(enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
of the HBA queue depth in the same ISR context then enable a flag
called 'change_smp_affinity' for this IRQ. Also created a thread with
will poll for this flag for every IRQ's (enabled by driver) for every
second. If this thread see that this flag is enabled for any IRQ then
it will write next CPU number from the CPUs enabled in the IRQ's
affinity_hint to the IRQ's smp_affinity procfs attribute using
'call_usermodehelper()' API.

This to make sure that interrupts are not processed by same single CPU
all the time and to make the other CPUs to handle the interrupts if
the current CPU is continuously busy with handling the other CPUs IO
interrupts.

For example consider a system which has 8 logical CPUs and one MSIx
vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
then IRQ's procfs attributes will be
IRQ# 120, affinity_hint=0xff, smp_affinity=0x00

After starting heavy IOs, we will observe that only CPU0 will be busy
with handling the interrupts. This experiment driver will change the
smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
/proc/irq/120/smp_affinity', driver issue's this cmd using
call_usermodehelper() API) if it observes that CPU0 is continuously
processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
CPU7.

Whether doing this kind of stuff in driver is ok?


Hello Sreekanth,

To me this sounds like something that should be implemented in the I/O 
chipset on the motherboard. If you have a look at the Intel Software 
Developer Manuals then you will see that logical destination mode 
supports round-robin interrupt delivery. However, the Linux kernel 
selects physical destination mode on systems with more than eight 
logical CPUs (see also arch/x86/kernel/apic/apic_flat_64.c).


I'm not sure the maintainers of the interrupt subsystem would welcome 
code that emulates round-robin interrupt delivery. So your best option 
is probably to minimize the amount of work that is done in interrupt 
context and to move as much work as possible out of interrupt context in 
such a way that it can be spread over multiple CPU cores, e.g. by using 
queue_work_on().


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Observing Softlockup's while running heavy IOs

2016-09-01 Thread Sreekanth Reddy
On Fri, Aug 19, 2016 at 9:26 PM, Bart Van Assche
 wrote:
> On 08/19/2016 04:44 AM, Sreekanth Reddy wrote:
>>
>> [  +0.000439] __blk_mq_run_hw_queue() finished after 10058 ms
>> [ ... ]
>> [  +0.05]  [] ? finish_task_switch+0x6b/0x200
>> [  +0.06]  [] __schedule+0x36c/0x950
>> [  +0.02]  [] schedule+0x37/0x80
>> [  +0.06]  [] futex_wait_queue_me+0xbc/0x120
>> [  +0.04]  [] futex_wait+0x119/0x270
>> [  +0.04]  [] ? futex_wake+0x90/0x180
>> [  +0.03]  [] do_futex+0x12b/0xb00
>> [  +0.05]  [] ? set_next_entity+0x23e/0x440
>> [  +0.07]  [] ? __switch_to+0x261/0x4b0
>> [  +0.04]  [] SyS_futex+0x81/0x180
>> [  +0.02]  [] ? schedule+0x37/0x80
>> [  +0.04]  [] entry_SYSCALL_64_fastpath+0x12/0x71
>
>
> Hello Sreekanth,
>
> If a "soft lockup" is reported that often means that kernel code is
> iterating too long in a loop without giving up the CPU. Inserting a
> cond_resched() call in such loops usually resolves these soft lockup
> complaints. However, your latest e-mail shows that the soft lockup complaint
> was reported on other code than __blk_mq_run_hw_queue(). I'm afraid this
> means that the CPU on which the soft lockup was reported is hammered so hard
> with interrupts that hardly any time remains for the scheduler to run code
> on that CPU. You will have to follow Robert Elliott's advice and reduce the
> time that is spent per CPU in interrupt context.
>

Sorry for delay in response as I was on Vacation.

Bart,

I reduced the ISR workload by one third in-order to reduce the time
that is spent per CPU in interrupt context, even then I am observing
softlockups.

As I mentioned before only same single CPU in the set of CPUs(enabled
in affinity_hint) is busy with handling the interrupts from
corresponding IRQx. I have done below experiment in driver to limit
these softlockups/hardlockups. But I am not sure whether it is
reasonable to do this in driver,

Experiment:
If the CPUx is continuously busy with handling the remote CPUs
(enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
of the HBA queue depth in the same ISR context then enable a flag
called 'change_smp_affinity' for this IRQ. Also created a thread with
will poll for this flag for every IRQ's (enabled by driver) for every
second. If this thread see that this flag is enabled for any IRQ then
it will write next CPU number from the CPUs enabled in the IRQ's
affinity_hint to the IRQ's smp_affinity procfs attribute using
'call_usermodehelper()' API.

This to make sure that interrupts are not processed by same single CPU
all the time and to make the other CPUs to handle the interrupts if
the current CPU is continuously busy with handling the other CPUs IO
interrupts.

For example consider a system which has 8 logical CPUs and one MSIx
vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
then IRQ's procfs attributes will be
IRQ# 120, affinity_hint=0xff, smp_affinity=0x00

After starting heavy IOs, we will observe that only CPU0 will be busy
with handling the interrupts. This experiment driver will change the
smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
/proc/irq/120/smp_affinity', driver issue's this cmd using
call_usermodehelper() API) if it observes that CPU0 is continuously
processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
CPU7.

Whether doing this kind of stuff in driver is ok?

Thanks,
Sreekanth

> Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Observing Softlockup's while running heavy IOs

2016-08-23 Thread Kashyap Desai
> -Original Message-
> From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com]
> Sent: Saturday, August 20, 2016 2:58 AM
> To: Sreekanth Reddy
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org;
> irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty;
> Chaitra Basappa; Suganath Prabu Subramani
> Subject: RE: Observing Softlockup's while running heavy IOs
>
>
>
> > -Original Message-
> > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> > Sent: Friday, August 19, 2016 6:45 AM
> > To: Elliott, Robert (Persistent Memory) 
> > Subject: Re: Observing Softlockup's while running heavy IOs
> >
> ...
> > Yes I am also observing that all the interrupts are routed to one CPU.
> > But still I observing softlockups (sometime hardlockups) even when I
> > set rq_affinity to 2.

How about below scenario ?  For simplicity. HBA with single MSI-x vector.
(Whenever HBA supports less MSI-x and logical CPUs are more on system, we
can see chance of these issue frequently..)

Assume we have 32 logical CPU  (4 socket, each with 8 logical CPU). CPU-0 is
not participating in IO.
Remaining CPU range from 1 to 31 is submitting IO. In such a scenario
rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not
help.

We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or
it is very rare to happen on field  ?


>
> That'll ensure the block layer's completion handling is done there, but
> not your
> driver's interrupt handler (which precedes the block layer completion
> handling).
>
>
> > Is their any way to route the interrupts the same CPUs which has
> > submitted the corresponding IOs?
> > or
> > Is their any way/option in the irqbalance/kernel which can route
> > interrupts to CPUs (enabled in affinity_hint) in round robin manner
> > after specific time period.
>
> Ensure your driver creates one MSIX interrupt per CPU core, uses that
> interrupt
> for all submissions from that core, and reports that it would like that
> interrupt to
> be serviced by that core in /proc/irq/nnn/affinity_hint.
>
> Even with hyperthreading, this needs to be based on the logical CPU cores,
> not
> just the physical core or the physical socket.
> You can swamp a logical CPU core as easily as a physical CPU core.
>
> Then, provide an irqbalance policy script that honors the affinity_hint
> for your
> driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity
> to
> match the affinity_hint.
>
> Some versions of irqbalance honor the hints; some purposely don't and need
> to
> be overridden with a policy script.
>
>
> ---
> Robert Elliott, HPE Persistent Memory
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Observing Softlockup's while running heavy IOs

2016-08-19 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> Sent: Friday, August 19, 2016 6:45 AM
> To: Elliott, Robert (Persistent Memory) 
> Subject: Re: Observing Softlockup's while running heavy IOs
> 
...
> Yes I am also observing that all the interrupts are routed to one
> CPU.  But still I observing softlockups (sometime hardlockups)
> even when I set rq_affinity to 2.

That'll ensure the block layer's completion handling is done there,
but not your driver's interrupt handler (which precedes the block
layer completion handling).

 
> Is their any way to route the interrupts the same CPUs which has
> submitted the corresponding IOs?
> or
> Is their any way/option in the irqbalance/kernel which can route
> interrupts to CPUs (enabled in affinity_hint) in round robin manner
> after specific time period.

Ensure your driver creates one MSIX interrupt per CPU core, uses
that interrupt for all submissions from that core, and reports
that it would like that interrupt to be serviced by that core
in /proc/irq/nnn/affinity_hint.  

Even with hyperthreading, this needs to be based on the logical
CPU cores, not just the physical core or the physical socket.
You can swamp a logical CPU core as easily as a physical CPU core.

Then, provide an irqbalance policy script that honors the
affinity_hint for your driver, or turn off irqbalance and
manually set /proc/irq/nnn/smp_affinity to match the
affinity_hint.  

Some versions of irqbalance honor the hints; some purposely
don't and need to be overridden with a policy script.


---
Robert Elliott, HPE Persistent Memory




Re: Observing Softlockup's while running heavy IOs

2016-08-19 Thread Bart Van Assche

On 08/19/2016 04:44 AM, Sreekanth Reddy wrote:

[  +0.000439] __blk_mq_run_hw_queue() finished after 10058 ms
[ ... ]
[  +0.05]  [] ? finish_task_switch+0x6b/0x200
[  +0.06]  [] __schedule+0x36c/0x950
[  +0.02]  [] schedule+0x37/0x80
[  +0.06]  [] futex_wait_queue_me+0xbc/0x120
[  +0.04]  [] futex_wait+0x119/0x270
[  +0.04]  [] ? futex_wake+0x90/0x180
[  +0.03]  [] do_futex+0x12b/0xb00
[  +0.05]  [] ? set_next_entity+0x23e/0x440
[  +0.07]  [] ? __switch_to+0x261/0x4b0
[  +0.04]  [] SyS_futex+0x81/0x180
[  +0.02]  [] ? schedule+0x37/0x80
[  +0.04]  [] entry_SYSCALL_64_fastpath+0x12/0x71


Hello Sreekanth,

If a "soft lockup" is reported that often means that kernel code is 
iterating too long in a loop without giving up the CPU. Inserting a 
cond_resched() call in such loops usually resolves these soft lockup 
complaints. However, your latest e-mail shows that the soft lockup 
complaint was reported on other code than __blk_mq_run_hw_queue(). I'm 
afraid this means that the CPU on which the soft lockup was reported is 
hammered so hard with interrupts that hardly any time remains for the 
scheduler to run code on that CPU. You will have to follow Robert 
Elliott's advice and reduce the time that is spent per CPU in interrupt 
context.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Observing Softlockup's while running heavy IOs

2016-08-19 Thread Sreekanth Reddy
.73  0.60 29.31 14.90  0.00
0.00  1.57  0.00  0.00 47.89
02:22:44 PM   0  8.82  0.00 48.53 38.24  0.00
0.00  0.00  0.00  0.00  4.41
02:22:44 PM   1 10.29  0.00 63.24 26.47  0.00
0.00  0.00  0.00  0.00  0.00
02:22:44 PM   2  8.96  0.00 65.67 25.37  0.00
0.00  0.00  0.00  0.00  0.00
02:22:44 PM   3 11.94  0.00 61.19 26.87  0.00
0.00  0.00  0.00  0.00  0.00
02:22:44 PM   4  8.96  0.00 64.18 26.87  0.00
0.00  0.00  0.00  0.00  0.00
02:22:44 PM   5  7.46  0.00 67.16 25.37  0.00
0.00  0.00  0.00  0.00  0.00
02:22:44 PM   6  0.00  0.00  0.00  0.00  0.00
0.00100.00  0.00  0.00  0.00
02:22:44 PM   7  1.33  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 98.67
02:22:44 PM   8  5.19  0.00  1.30  0.00  0.00
0.00  0.00  0.00  0.00 93.51
02:22:44 PM   9  2.67  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 97.33
02:22:44 PM  10  2.67  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 97.33
02:22:44 PM  11  3.95  0.00  1.32  1.32  0.00
0.00  0.00  0.00  0.00 93.42
02:22:44 PM  12 11.11  0.00 83.33  5.56  0.00
0.00  0.00  0.00  0.00  0.00
02:22:44 PM  13  8.70  0.00 52.17 36.23  0.00
0.00  0.00  0.00  0.00  2.90
02:22:44 PM  14  6.15  0.00 53.85 38.46  0.00
0.00  0.00  0.00  0.00  1.54
02:22:44 PM  15  5.97  0.00 55.22 37.31  0.00
0.00  0.00  0.00  0.00  1.49
02:22:44 PM  16  8.70  0.00 52.17 36.23  0.00
0.00  0.00  0.00  0.00  2.90
02:22:44 PM  17  7.35  0.00 55.88 35.29  0.00
0.00  0.00  0.00  0.00  1.47
02:22:44 PM  18  0.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00100.00
02:22:44 PM  19  2.63  0.00  1.32  2.63  0.00
0.00  0.00  0.00  0.00 93.42
02:22:44 PM  20  2.67  0.00  1.33  0.00  0.00
0.00  0.00  0.00  0.00 96.00
02:22:44 PM  21  7.89  5.26  2.63  0.00  0.00
0.00  0.00  0.00  0.00 84.21
02:22:44 PM  22  1.32  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 98.68
02:22:44 PM  23  4.05  8.11  2.70  4.05  0.00
0.00  0.00  0.00  0.00 81.08

Still I am continuing my investigation on this.

Note:
I am taking vacation next week, please expect some delay for response.

Thanks,
Sreekanth

On Fri, Aug 19, 2016 at 2:38 AM, Elliott, Robert (Persistent Memory)
 wrote:
>
>
>> -Original Message-
>> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
>> ow...@vger.kernel.org] On Behalf Of Sreekanth Reddy
>> Sent: Thursday, August 18, 2016 12:56 AM
>> Subject: Observing Softlockup's while running heavy IOs
>>
>> Problem statement:
>> Observing softlockups while running heavy IOs on 8 SSD drives
>> connected behind our LSI SAS 3004 HBA.
>>
> ...
>> Observing a loop in the IO path, i.e only one CPU is busy with
>> processing the interrupts and other CPUs (in the affinity_hint mask)
>> are busy with sending the IOs (these CPUs are not yet all receiving
>> any interrupts). For example, only CPU6 is busy with processing the
>> interrupts from IRQ 219 and remaining CPUs i.e CPU 7,8,9,10 & 11 are
>> just busy with pumping the IOs and they never processed any IO
>> interrupts from IRQ 219. So we are observing softlockups due to
>> existence this loop in the IO Path.
>>
>> We may not observe these softlockups if irqbalancer might have
>> balanced the interrupts among the CPUs enabled in the particular
>> irq's
>> affinity_hint mask. so that all the CPUs are equaly busy with send
>> IOs
>> and processing the interrupts. I am not sure how irqbalancer balance
>> the load among the CPUs, but here I see only one CPU from irq's
>> affinity_hint mask is busy with interrupts and remaining CPUs won't
>> receive any interrupts from this IRQ.
>>
>> Please help me with any suggestions/recomendations to slove/limit
>> these kind of softlockups. Also please let me known if I have missed
>> any setting in the irqbalance.
>>
>

Re: Observing Softlockup's while running heavy IOs

2016-08-18 Thread Bart Van Assche
On 08/17/16 22:55, Sreekanth Reddy wrote:
> Observing softlockups while running heavy IOs on 8 SSD drives
> connected behind our LSI SAS 3004 HBA.

Hello Sreekanth,

This means that more than 23s was spent before the scheduler was 
invoked, probably due to a loop. Can you give the attached (untested) 
patch a try to see whether it is the loop in __blk_mq_run_hw_queue()?

Thanks,

Bart.

From 4da94f2ec37ee5d1b4a5f1ce2886bdafd5cd394c Mon Sep 17 00:00:00 2001
From: Bart Van Assche 
Date: Thu, 18 Aug 2016 07:51:49 -0700
Subject: [PATCH] block: Measure __blk_mq_run_hw_queue() execution time

Note: the "max_elapsed" variable can be modified by multiple threads
concurrently.
---
 block/blk-mq.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e931a0e..6d0961c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -792,6 +792,9 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
 	int queued;
+	static long max_elapsed = -1;
+	unsigned long start = jiffies;
+	long elapsed;
 
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
 
@@ -889,6 +892,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
 	}
+
+	elapsed = jiffies - start;
+	if (elapsed > max_elapsed) {
+		max_elapsed = elapsed;
+		pr_info("%s() finished after %d ms\n", __func__,
+			jiffies_to_msecs(elapsed));
+	}
 }
 
 /*
-- 
2.9.2



RE: Observing Softlockup's while running heavy IOs

2016-08-18 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Sreekanth Reddy
> Sent: Thursday, August 18, 2016 12:56 AM
> Subject: Observing Softlockup's while running heavy IOs
> 
> Problem statement:
> Observing softlockups while running heavy IOs on 8 SSD drives
> connected behind our LSI SAS 3004 HBA.
> 
...
> Observing a loop in the IO path, i.e only one CPU is busy with
> processing the interrupts and other CPUs (in the affinity_hint mask)
> are busy with sending the IOs (these CPUs are not yet all receiving
> any interrupts). For example, only CPU6 is busy with processing the
> interrupts from IRQ 219 and remaining CPUs i.e CPU 7,8,9,10 & 11 are
> just busy with pumping the IOs and they never processed any IO
> interrupts from IRQ 219. So we are observing softlockups due to
> existence this loop in the IO Path.
> 
> We may not observe these softlockups if irqbalancer might have
> balanced the interrupts among the CPUs enabled in the particular
> irq's
> affinity_hint mask. so that all the CPUs are equaly busy with send
> IOs
> and processing the interrupts. I am not sure how irqbalancer balance
> the load among the CPUs, but here I see only one CPU from irq's
> affinity_hint mask is busy with interrupts and remaining CPUs won't
> receive any interrupts from this IRQ.
> 
> Please help me with any suggestions/recomendations to slove/limit
> these kind of softlockups. Also please let me known if I have missed
> any setting in the irqbalance.
> 

The CPUs need to be forced to self-throttle by processing interrupts for 
their own submissions, which reduces the time they can submit more IOs.

See https://lkml.org/lkml/2014/9/9/931 for discussion of this
problem when blk-mq was added.


---
Robert Elliott, HPE Persistent Memory





Observing Softlockup's while running heavy IOs

2016-08-17 Thread Sreekanth Reddy
Hi,

Problem statement:
Observing softlockups while running heavy IOs on 8 SSD drives
connected behind our LSI SAS 3004 HBA.

System configuration:
OS & kernel version: Fedora 23, v4.2.3-300.fc23.x86_64
NUMA : disabled,
CPUs : 24 logical cpus,
SCSI_MQ:  enabled
Driver : mpt3sas,
MSIx vectors: we have enabled only 2 MSIx vectors using driver module
parameter 'max_msix_vectors' set to 2,
IRQbalance version: v1.0.9
IRQbalance policy : 'subset'
rq_affinity : 2

mpt3sas IRQs info:
irq number =  219, smp_affinity = 40 affinity_hint = 000fff
irq number =  220, smp_affinity = 001000 affinity_hint = fff000

issued IOs using fio tool with below parameters,
iodepth=128
direct=1
runtime=300
group_reporting
ioengine=libaio
cpus_allowed=6,7,8,9,10,11,18,19,20,21,22,23
time_based
[Ran-Read-4k-IOPs]
numjobs=24
rw=randread
bs=4k


My Analysis:
Observing a loop in the IO path, i.e only one CPU is busy with
processing the interrupts and other CPUs (in the affinity_hint mask)
are busy with sending the IOs (these CPUs are not yet all receiving
any interrupts). For example, only CPU6 is busy with processing the
interrupts from IRQ 219 and remaining CPUs i.e CPU 7,8,9,10 & 11 are
just busy with pumping the IOs and they never processed any IO
interrupts from IRQ 219. So we are observing softlockups due to
existence this loop in the IO Path.

We may not observe these softlockups if irqbalancer might have
balanced the interrupts among the CPUs enabled in the particular irq's
affinity_hint mask. so that all the CPUs are equaly busy with send IOs
and processing the interrupts. I am not sure how irqbalancer balance
the load among the CPUs, but here I see only one CPU from irq's
affinity_hint mask is busy with interrupts and remaining CPUs won't
receive any interrupts from this IRQ.

Please help me with any suggestions/recomendations to slove/limit
these kind of softlockups. Also please let me known if I have missed
any setting in the irqbalance.

Here is my 'sar -u ALL -P ALL 1' command output when softlock occurred,

12:16:11 PM CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
12:16:12 PM all  6.31  0.00 31.62  4.49  0.00
0.00  2.59  0.00  0.00 54.99
12:16:12 PM   0  4.08  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 95.92
12:16:12 PM   1  4.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.00
12:16:12 PM   2  2.97  0.00  1.98  0.00  0.00
0.00  0.00  0.00  0.00 95.05
12:16:12 PM   3  7.00  0.00  1.00  0.00  0.00
0.00  0.00  0.00  0.00 92.00
12:16:12 PM   4  4.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.00
12:16:12 PM   5  4.04  0.00  1.01  0.00  0.00
0.00  0.00  0.00  0.00 94.95
12:16:12 PM   6  0.00  0.00  0.00  0.00  0.00
0.00100.00  0.00  0.00  0.00
12:16:12 PM   7 17.02  0.00 58.51  0.00  0.00
0.00  0.00  0.00  0.00 24.47
12:16:12 PM   8 14.13  0.00 61.96 23.91  0.00
0.00  0.00  0.00  0.00  0.00
12:16:12 PM   9  5.21  0.00 14.58 80.21  0.00
0.00  0.00  0.00  0.00  0.00
12:16:12 PM  10  4.21  0.00 14.74  0.00  0.00
0.00  0.00  0.00  0.00 81.05
12:16:12 PM  11 13.04  0.00 61.96  0.00  0.00
0.00  0.00  0.00  0.00 25.00
12:16:12 PM  12  0.00  0.00  0.00  0.00  0.00
0.00100.00  0.00  0.00  0.00
12:16:12 PM  13  4.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.00
12:16:12 PM  14  4.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.00
12:16:12 PM  15  3.03  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.97
12:16:12 PM  16  4.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.00
12:16:12 PM  17  4.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00 96.00
12:16:12 PM  18  0.00  0.00100.00  0.00  0.00
0.00  0.00  0.00  0.00  0.00
12:16:12 PM  19 16.84  0.00 69.47  0.00  0.00
0.00  0.00  0.00  0.00 13.68
12:16:12 PM  20 16.13  0.00 69.89  0.00  0.00
0.00  0.00  0.00  0.00 13.98
12:16:12 PM  21  0.00  0.00100.00  0.00  0.00
0.00  0.00  0.00  0.00  0.00
12:16:12 PM  22  0.00  0.00100.00  0.00  0.00
0.00  0.00  0.00  0.