Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-29 Thread Dmitry Safonov
2018-03-29 9:50 GMT+01:00 Joerg Roedel :
> On Tue, Mar 20, 2018 at 08:50:13PM +, Dmitry Safonov wrote:
>> Hmm, but this fixes my softlockup issue, because it's about time spent
>> in printk() inside irq-disabled section, rather about exiting the dmar-
>> clearing loop.
>> And on my hw doesn't make any difference to limit loop or not because
>> clearing a fault is much faster than hw could generate a new fault.
>> ITOW, it fixes the softlockup for me and the loop-related lockup can't
>> happen on hw I have (so it's the other issue, [possible?] on other hw).
>
> It might solve your issue, but someone else might still run into it with
> a different setup. An upstream fix needs to solve it for everyone.

Ok, I'll resend v4 with an additional patch to limit the loop.

Thanks,
 Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-29 Thread Joerg Roedel
On Tue, Mar 20, 2018 at 08:50:13PM +, Dmitry Safonov wrote:
> Hmm, but this fixes my softlockup issue, because it's about time spent
> in printk() inside irq-disabled section, rather about exiting the dmar-
> clearing loop.
> And on my hw doesn't make any difference to limit loop or not because
> clearing a fault is much faster than hw could generate a new fault.
> ITOW, it fixes the softlockup for me and the loop-related lockup can't
> happen on hw I have (so it's the other issue, [possible?] on other hw).

It might solve your issue, but someone else might still run into it with
a different setup. An upstream fix needs to solve it for everyone.

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-20 Thread Dmitry Safonov via iommu
On Thu, 2018-03-15 at 16:28 +0100, Joerg Roedel wrote:
> On Thu, Mar 15, 2018 at 02:42:00PM +, Dmitry Safonov wrote:
> > But even with loop-limit we will need ratelimit each printk()
> > *also*.
> > Otherwise loop-limit will be based on time spent printing, not on
> > anything else..
> > The patch makes sense even with loop-limit in my opinion.
> 
> Looks like I mis-read your patch, somehow it looked to me as if you
> replace all 'ratelimited' usages with a call to __ratelimit(), but
> you
> just move 'ratelimited' into the loop, which actually makes sense.

So, is it worth to apply the patch?

> But still, this alone is no proper fix for the soft-lockups you are
> seeing.

Hmm, but this fixes my softlockup issue, because it's about time spent
in printk() inside irq-disabled section, rather about exiting the dmar-
clearing loop.
And on my hw doesn't make any difference to limit loop or not because
clearing a fault is much faster than hw could generate a new fault.
ITOW, it fixes the softlockup for me and the loop-related lockup can't
happen on hw I have (so it's the other issue, [possible?] on other hw).

-- 
Thanks,
 Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Dmitry Safonov via iommu
On Thu, 2018-03-15 at 16:28 +0100, Joerg Roedel wrote:
> On Thu, Mar 15, 2018 at 02:42:00PM +, Dmitry Safonov wrote:
> > But even with loop-limit we will need ratelimit each printk()
> > *also*.
> > Otherwise loop-limit will be based on time spent printing, not on
> > anything else..
> > The patch makes sense even with loop-limit in my opinion.
> 
> Looks like I mis-read your patch, somehow it looked to me as if you
> replace all 'ratelimited' usages with a call to __ratelimit(), but
> you
> just move 'ratelimited' into the loop, which actually makes sense.

Oh, ok

> But still, this alone is no proper fix for the soft-lockups you are
> seeing.

Well, I can also limit number of loops with say cap_num_fault_regs().
I didn't do that as on my measures the time spent on clearing a fault
is so small, that I'm not sure if it's possible to stuck in this loop.

-- 
Thanks,
 Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Joerg Roedel
On Thu, Mar 15, 2018 at 02:42:00PM +, Dmitry Safonov wrote:
> But even with loop-limit we will need ratelimit each printk() *also*.
> Otherwise loop-limit will be based on time spent printing, not on
> anything else..
> The patch makes sense even with loop-limit in my opinion.

Looks like I mis-read your patch, somehow it looked to me as if you
replace all 'ratelimited' usages with a call to __ratelimit(), but you
just move 'ratelimited' into the loop, which actually makes sense.

But still, this alone is no proper fix for the soft-lockups you are
seeing.


Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Dmitry Safonov via iommu
On Thu, 2018-03-15 at 14:34 +, Dmitry Safonov wrote:
> On Thu, 2018-03-15 at 15:22 +0100, Joerg Roedel wrote:
> > On Thu, Mar 15, 2018 at 02:13:03PM +, Dmitry Safonov wrote:
> > > So, you suggest to remove ratelimit at all?
> > > Do we really need printk flood for each happened fault?
> > > Imagine, you've hundreds of mappings and then PCI link flapped..
> > > Wouldn't it be better to keep ratelimiting?
> > > I don't mind, just it looks a bit strange to me.
> > 
> > I never said you should remove the ratelimiting, after all you are
> > trying to fix a soft-lockup, no?
> > 
> > And that should not be fixed by changes to the ratelimiting, but
> > with
> > proper irq handling.
> 
> Uh, I'm a bit confused then.
> - Isn't it better to ratelimit each printk() instead of bunch of
> printks inside irq handler?
> - I can limit the number of loops, but the most of the time is spent
> in
> the loop on printk() (on my machine ~170msec per loop), while
> everything else takes much lesser time (on my machine < 1 usec per
> loop). So, if I will limit number of loops per-irq, that cycle-limit
> will be based on limiting time spent on printk (e.g., how many
> printks
> to do in atomic context so that node will not lockup). It smells like
> ratelimiting, no?
> 
> I must be misunderstanding something, but why introducing another
> limit
> for number of printk() called when there is ratelimit which may be
> tuned..
> 

So I agree, that maybe better to have another limit to the cycle
*also*, because if we clean faults with the same speed as they're
generated by hw, we may stuck in the loop..
By on my measures clearing fault is so fast (< 1 usec), that I'm not
sure that it may happen with hw. By that reason I didn't introduce
loop-limit.

But even with loop-limit we will need ratelimit each printk() *also*.
Otherwise loop-limit will be based on time spent printing, not on
anything else..
The patch makes sense even with loop-limit in my opinion.

-- 
Thanks,
 Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Dmitry Safonov via iommu
On Thu, 2018-03-15 at 15:22 +0100, Joerg Roedel wrote:
> On Thu, Mar 15, 2018 at 02:13:03PM +, Dmitry Safonov wrote:
> > So, you suggest to remove ratelimit at all?
> > Do we really need printk flood for each happened fault?
> > Imagine, you've hundreds of mappings and then PCI link flapped..
> > Wouldn't it be better to keep ratelimiting?
> > I don't mind, just it looks a bit strange to me.
> 
> I never said you should remove the ratelimiting, after all you are
> trying to fix a soft-lockup, no?
> 
> And that should not be fixed by changes to the ratelimiting, but with
> proper irq handling.

Uh, I'm a bit confused then.
- Isn't it better to ratelimit each printk() instead of bunch of
printks inside irq handler?
- I can limit the number of loops, but the most of the time is spent in
the loop on printk() (on my machine ~170msec per loop), while
everything else takes much lesser time (on my machine < 1 usec per
loop). So, if I will limit number of loops per-irq, that cycle-limit
will be based on limiting time spent on printk (e.g., how many printks
to do in atomic context so that node will not lockup). It smells like
ratelimiting, no?

I must be misunderstanding something, but why introducing another limit
for number of printk() called when there is ratelimit which may be
tuned..

-- 
Thanks,
 Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Joerg Roedel
On Thu, Mar 15, 2018 at 02:13:03PM +, Dmitry Safonov wrote:
> So, you suggest to remove ratelimit at all?
> Do we really need printk flood for each happened fault?
> Imagine, you've hundreds of mappings and then PCI link flapped..
> Wouldn't it be better to keep ratelimiting?
> I don't mind, just it looks a bit strange to me.

I never said you should remove the ratelimiting, after all you are
trying to fix a soft-lockup, no?

And that should not be fixed by changes to the ratelimiting, but with
proper irq handling.


Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Dmitry Safonov via iommu
On Thu, 2018-03-15 at 14:46 +0100, Joerg Roedel wrote:
> On Thu, Feb 15, 2018 at 07:17:29PM +, Dmitry Safonov wrote:
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index accf58388bdb..6c4ea32ee6a9 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void
> > *dev_id)
> > int reg, fault_index;
> > u32 fault_status;
> > unsigned long flag;
> > -   bool ratelimited;
> > static DEFINE_RATELIMIT_STATE(rs,
> >   DEFAULT_RATELIMIT_INTERVAL,
> >   DEFAULT_RATELIMIT_BURST);
> >  
> > -   /* Disable printing, simply clear the fault when
> > ratelimited */
> > -   ratelimited = !__ratelimit(&rs);
> > -
> > raw_spin_lock_irqsave(&iommu->register_lock, flag);
> > fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> > -   if (fault_status && !ratelimited)
> > +   if (fault_status && __ratelimit(&rs))
> > pr_err("DRHD: handling fault status reg %x\n",
> > fault_status);
> 
> This looks aweful. Have you tried to limit the number of loops in
> this
> function and returning? You can handle the next faults by the next
> interrupt. This ensures that the cpu visits a scheduling point from
> time
> to time so that you don't see soft-lockups.

So, you suggest to remove ratelimit at all?
Do we really need printk flood for each happened fault?
Imagine, you've hundreds of mappings and then PCI link flapped..
Wouldn't it be better to keep ratelimiting?
I don't mind, just it looks a bit strange to me.

-- 
Thanks,
 Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-15 Thread Joerg Roedel
On Thu, Feb 15, 2018 at 07:17:29PM +, Dmitry Safonov wrote:
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index accf58388bdb..6c4ea32ee6a9 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>   int reg, fault_index;
>   u32 fault_status;
>   unsigned long flag;
> - bool ratelimited;
>   static DEFINE_RATELIMIT_STATE(rs,
> DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
>  
> - /* Disable printing, simply clear the fault when ratelimited */
> - ratelimited = !__ratelimit(&rs);
> -
>   raw_spin_lock_irqsave(&iommu->register_lock, flag);
>   fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> - if (fault_status && !ratelimited)
> + if (fault_status && __ratelimit(&rs))
>   pr_err("DRHD: handling fault status reg %x\n", fault_status);

This looks aweful. Have you tried to limit the number of loops in this
function and returning? You can handle the next faults by the next
interrupt. This ensures that the cpu visits a scheduling point from time
to time so that you don't see soft-lockups.



Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-13 Thread Dmitry Safonov via iommu
Gentle ping?

On Mon, 2018-03-05 at 15:00 +, Dmitry Safonov wrote:
> Hi Joerg,
> 
> What do you think about v3?
> It looks like, I can solve my softlookups with just a bit more proper
> ratelimiting..
> 
> On Thu, 2018-02-15 at 19:17 +, Dmitry Safonov wrote:
> > There is a ratelimit for printing, but it's incremented each time
> > the
> > cpu recives dmar fault interrupt. While one interrupt may signal
> > about
> > *many* faults.
> > So, measuring the impact it turns out that reading/clearing one
> > fault
> > takes < 1 usec, and printing info about the fault takes ~170 msec.
> > 
> > Having in mind that maximum number of fault recording registers per
> > remapping hardware unit is 256.. IRQ handler may run for (170*256)
> > msec.
> > And as fault-serving loop runs without a time limit, during
> > servicing
> > new faults may occur..
> > 
> > Ratelimit each fault printing rather than each irq printing.
> > 
> > Fixes: commit c43fce4eebae ("iommu/vt-d: Ratelimit fault handler")
> > 
> > BUG: spinlock lockup suspected on CPU#0, CliShell/9903
> >  lock: 0x81a47440, .magic: dead4ead, .owner:
> > kworker/u16:2/8915, .owner_cpu: 6
> > CPU: 0 PID: 9903 Comm: CliShell
> > Call Trace:$\n'
> > [..] dump_stack+0x65/0x83$\n'
> > [..] spin_dump+0x8f/0x94$\n'
> > [..] do_raw_spin_lock+0x123/0x170$\n'
> > [..] _raw_spin_lock_irqsave+0x32/0x3a$\n'
> > [..] uart_chars_in_buffer+0x20/0x4d$\n'
> > [..] tty_chars_in_buffer+0x18/0x1d$\n'
> > [..] n_tty_poll+0x1cb/0x1f2$\n'
> > [..] tty_poll+0x5e/0x76$\n'
> > [..] do_select+0x363/0x629$\n'
> > [..] compat_core_sys_select+0x19e/0x239$\n'
> > [..] compat_SyS_select+0x98/0xc0$\n'
> > [..] sysenter_dispatch+0x7/0x25$\n'
> > [..]
> > NMI backtrace for cpu 6
> > CPU: 6 PID: 8915 Comm: kworker/u16:2
> > Workqueue: dmar_fault dmar_fault_work
> > Call Trace:$\n'
> > [..] wait_for_xmitr+0x26/0x8f$\n'
> > [..] serial8250_console_putchar+0x1c/0x2c$\n'
> > [..] uart_console_write+0x40/0x4b$\n'
> > [..] serial8250_console_write+0xe6/0x13f$\n'
> > [..] call_console_drivers.constprop.13+0xce/0x103$\n'
> > [..] console_unlock+0x1f8/0x39b$\n'
> > [..] vprintk_emit+0x39e/0x3e6$\n'
> > [..] printk+0x4d/0x4f$\n'
> > [..] dmar_fault+0x1a8/0x1fc$\n'
> > [..] dmar_fault_work+0x15/0x17$\n'
> > [..] process_one_work+0x1e8/0x3a9$\n'
> > [..] worker_thread+0x25d/0x345$\n'
> > [..] kthread+0xea/0xf2$\n'
> > [..] ret_from_fork+0x58/0x90$\n'
> > 
> > Cc: Alex Williamson 
> > Cc: David Woodhouse 
> > Cc: Ingo Molnar 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Cc: iommu@lists.linux-foundation.org
> > Signed-off-by: Dmitry Safonov 
> > ---
> > Maybe it's worth to limit while(1) cycle.
> > If IOMMU generates faults with equal speed as irq handler cleans
> > them, it may turn into long-irq-disabled region again.
> > Not sure if it can happen anyway.
> > 
> >  drivers/iommu/dmar.c | 8 +++-
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index accf58388bdb..6c4ea32ee6a9 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void
> > *dev_id)
> > int reg, fault_index;
> > u32 fault_status;
> > unsigned long flag;
> > -   bool ratelimited;
> > static DEFINE_RATELIMIT_STATE(rs,
> >   DEFAULT_RATELIMIT_INTERVAL,
> >   DEFAULT_RATELIMIT_BURST);
> >  
> > -   /* Disable printing, simply clear the fault when
> > ratelimited
> > */
> > -   ratelimited = !__ratelimit(&rs);
> > -
> > raw_spin_lock_irqsave(&iommu->register_lock, flag);
> > fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> > -   if (fault_status && !ratelimited)
> > +   if (fault_status && __ratelimit(&rs))
> > pr_err("DRHD: handling fault status reg %x\n",
> > fault_status);
> >  
> > /* TBD: ignore advanced fault log currently */
> > @@ -1638,6 +1634,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
> > fault_index = dma_fsts_fault_record_index(fault_status);
> > reg = cap_fault_reg_offset(iommu->cap);
> > while (1) {
> > +   /* Disable printing, simply clear the fault when
> > ratelimited */
> > +   bool ratelimited = !__ratelimit(&rs);
> > u8 fault_reason;
> > u16 source_id;
> > u64 guest_addr;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-03-05 Thread Dmitry Safonov via iommu
Hi Joerg,

What do you think about v3?
It looks like, I can solve my softlookups with just a bit more proper
ratelimiting..

On Thu, 2018-02-15 at 19:17 +, Dmitry Safonov wrote:
> There is a ratelimit for printing, but it's incremented each time the
> cpu recives dmar fault interrupt. While one interrupt may signal
> about
> *many* faults.
> So, measuring the impact it turns out that reading/clearing one fault
> takes < 1 usec, and printing info about the fault takes ~170 msec.
> 
> Having in mind that maximum number of fault recording registers per
> remapping hardware unit is 256.. IRQ handler may run for (170*256)
> msec.
> And as fault-serving loop runs without a time limit, during servicing
> new faults may occur..
> 
> Ratelimit each fault printing rather than each irq printing.
> 
> Fixes: commit c43fce4eebae ("iommu/vt-d: Ratelimit fault handler")
> 
> BUG: spinlock lockup suspected on CPU#0, CliShell/9903
>  lock: 0x81a47440, .magic: dead4ead, .owner:
> kworker/u16:2/8915, .owner_cpu: 6
> CPU: 0 PID: 9903 Comm: CliShell
> Call Trace:$\n'
> [..] dump_stack+0x65/0x83$\n'
> [..] spin_dump+0x8f/0x94$\n'
> [..] do_raw_spin_lock+0x123/0x170$\n'
> [..] _raw_spin_lock_irqsave+0x32/0x3a$\n'
> [..] uart_chars_in_buffer+0x20/0x4d$\n'
> [..] tty_chars_in_buffer+0x18/0x1d$\n'
> [..] n_tty_poll+0x1cb/0x1f2$\n'
> [..] tty_poll+0x5e/0x76$\n'
> [..] do_select+0x363/0x629$\n'
> [..] compat_core_sys_select+0x19e/0x239$\n'
> [..] compat_SyS_select+0x98/0xc0$\n'
> [..] sysenter_dispatch+0x7/0x25$\n'
> [..]
> NMI backtrace for cpu 6
> CPU: 6 PID: 8915 Comm: kworker/u16:2
> Workqueue: dmar_fault dmar_fault_work
> Call Trace:$\n'
> [..] wait_for_xmitr+0x26/0x8f$\n'
> [..] serial8250_console_putchar+0x1c/0x2c$\n'
> [..] uart_console_write+0x40/0x4b$\n'
> [..] serial8250_console_write+0xe6/0x13f$\n'
> [..] call_console_drivers.constprop.13+0xce/0x103$\n'
> [..] console_unlock+0x1f8/0x39b$\n'
> [..] vprintk_emit+0x39e/0x3e6$\n'
> [..] printk+0x4d/0x4f$\n'
> [..] dmar_fault+0x1a8/0x1fc$\n'
> [..] dmar_fault_work+0x15/0x17$\n'
> [..] process_one_work+0x1e8/0x3a9$\n'
> [..] worker_thread+0x25d/0x345$\n'
> [..] kthread+0xea/0xf2$\n'
> [..] ret_from_fork+0x58/0x90$\n'
> 
> Cc: Alex Williamson 
> Cc: David Woodhouse 
> Cc: Ingo Molnar 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Cc: iommu@lists.linux-foundation.org
> Signed-off-by: Dmitry Safonov 
> ---
> Maybe it's worth to limit while(1) cycle.
> If IOMMU generates faults with equal speed as irq handler cleans
> them, it may turn into long-irq-disabled region again.
> Not sure if it can happen anyway.
> 
>  drivers/iommu/dmar.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index accf58388bdb..6c4ea32ee6a9 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>   int reg, fault_index;
>   u32 fault_status;
>   unsigned long flag;
> - bool ratelimited;
>   static DEFINE_RATELIMIT_STATE(rs,
> DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
>  
> - /* Disable printing, simply clear the fault when ratelimited
> */
> - ratelimited = !__ratelimit(&rs);
> -
>   raw_spin_lock_irqsave(&iommu->register_lock, flag);
>   fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> - if (fault_status && !ratelimited)
> + if (fault_status && __ratelimit(&rs))
>   pr_err("DRHD: handling fault status reg %x\n",
> fault_status);
>  
>   /* TBD: ignore advanced fault log currently */
> @@ -1638,6 +1634,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>   fault_index = dma_fsts_fault_record_index(fault_status);
>   reg = cap_fault_reg_offset(iommu->cap);
>   while (1) {
> + /* Disable printing, simply clear the fault when
> ratelimited */
> + bool ratelimited = !__ratelimit(&rs);
>   u8 fault_reason;
>   u16 source_id;
>   u64 guest_addr;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCHv3] iommu/intel: Ratelimit each dmar fault printing

2018-02-15 Thread Dmitry Safonov via iommu
There is a ratelimit for printing, but it's incremented each time the
cpu recives dmar fault interrupt. While one interrupt may signal about
*many* faults.
So, measuring the impact it turns out that reading/clearing one fault
takes < 1 usec, and printing info about the fault takes ~170 msec.

Having in mind that maximum number of fault recording registers per
remapping hardware unit is 256.. IRQ handler may run for (170*256) msec.
And as fault-serving loop runs without a time limit, during servicing
new faults may occur..

Ratelimit each fault printing rather than each irq printing.

Fixes: commit c43fce4eebae ("iommu/vt-d: Ratelimit fault handler")

BUG: spinlock lockup suspected on CPU#0, CliShell/9903
 lock: 0x81a47440, .magic: dead4ead, .owner: kworker/u16:2/8915, 
.owner_cpu: 6
CPU: 0 PID: 9903 Comm: CliShell
Call Trace:$\n'
[..] dump_stack+0x65/0x83$\n'
[..] spin_dump+0x8f/0x94$\n'
[..] do_raw_spin_lock+0x123/0x170$\n'
[..] _raw_spin_lock_irqsave+0x32/0x3a$\n'
[..] uart_chars_in_buffer+0x20/0x4d$\n'
[..] tty_chars_in_buffer+0x18/0x1d$\n'
[..] n_tty_poll+0x1cb/0x1f2$\n'
[..] tty_poll+0x5e/0x76$\n'
[..] do_select+0x363/0x629$\n'
[..] compat_core_sys_select+0x19e/0x239$\n'
[..] compat_SyS_select+0x98/0xc0$\n'
[..] sysenter_dispatch+0x7/0x25$\n'
[..]
NMI backtrace for cpu 6
CPU: 6 PID: 8915 Comm: kworker/u16:2
Workqueue: dmar_fault dmar_fault_work
Call Trace:$\n'
[..] wait_for_xmitr+0x26/0x8f$\n'
[..] serial8250_console_putchar+0x1c/0x2c$\n'
[..] uart_console_write+0x40/0x4b$\n'
[..] serial8250_console_write+0xe6/0x13f$\n'
[..] call_console_drivers.constprop.13+0xce/0x103$\n'
[..] console_unlock+0x1f8/0x39b$\n'
[..] vprintk_emit+0x39e/0x3e6$\n'
[..] printk+0x4d/0x4f$\n'
[..] dmar_fault+0x1a8/0x1fc$\n'
[..] dmar_fault_work+0x15/0x17$\n'
[..] process_one_work+0x1e8/0x3a9$\n'
[..] worker_thread+0x25d/0x345$\n'
[..] kthread+0xea/0xf2$\n'
[..] ret_from_fork+0x58/0x90$\n'

Cc: Alex Williamson 
Cc: David Woodhouse 
Cc: Ingo Molnar 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Dmitry Safonov 
---
Maybe it's worth to limit while(1) cycle.
If IOMMU generates faults with equal speed as irq handler cleans
them, it may turn into long-irq-disabled region again.
Not sure if it can happen anyway.

 drivers/iommu/dmar.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index accf58388bdb..6c4ea32ee6a9 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
int reg, fault_index;
u32 fault_status;
unsigned long flag;
-   bool ratelimited;
static DEFINE_RATELIMIT_STATE(rs,
  DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
-   /* Disable printing, simply clear the fault when ratelimited */
-   ratelimited = !__ratelimit(&rs);
-
raw_spin_lock_irqsave(&iommu->register_lock, flag);
fault_status = readl(iommu->reg + DMAR_FSTS_REG);
-   if (fault_status && !ratelimited)
+   if (fault_status && __ratelimit(&rs))
pr_err("DRHD: handling fault status reg %x\n", fault_status);
 
/* TBD: ignore advanced fault log currently */
@@ -1638,6 +1634,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
fault_index = dma_fsts_fault_record_index(fault_status);
reg = cap_fault_reg_offset(iommu->cap);
while (1) {
+   /* Disable printing, simply clear the fault when ratelimited */
+   bool ratelimited = !__ratelimit(&rs);
u8 fault_reason;
u16 source_id;
u64 guest_addr;
-- 
2.13.6

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu