On Thursday 12/11 at 15:43 +0100, Borislav Petkov wrote:
> On Wed, Dec 10, 2014 at 06:34:38PM -0800, Calvin Owens wrote:
> > This doens't fix the original issue though: the timer double-add can
> > still happen if the CMCI interrupt that hits the STORM threshold pops
> > off during mce_timer_fn() and calls mce_timer_kick().
> 
> Right, I guess we could do something similar to what you proposed
> already (diff below ontop of my previous diff).

That looks good, thanks :)
 
> > The polling is unnecessary on a CMCI-capable machine except in STORMs,
> > right? Wouldn't it be better to eliminate it in that case?
> 
> Right, the problem with error thresholding is that it only counts errors
> and raises the APIC interrupt when a certain count has been reached.
> But there's no way for the software to look at *each* error and try to
> discover trends and stuff.
> 
> Thus, we've been working on something which looks at each memory error,
> collects them and then decays them over time and acts only for the
> significant ones by poisoning pages:
> 
> https://urldefense.proofpoint.com/v1/url?u=https://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp%40alien8.de&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=oEb120Cp%2FehdhuY2M7qjelK5yT8IPB5WC2nEG4obDh4%3D%0A&m=Uvvje79YqtSSsaolAYcJ9N3FPGiWTIz7feUxwbhAkNc%3D%0A&s=c88a2b8458e7f07ba887d26414f2ce5b80f9180353d0895a7f4f8b47d19dbec8
> 
> When we have that thing in place we can probably even go a step further
> and simply disable error thresholding because it simply doesn't give us
> what we need.
> 
> Btw, we would very much welcome your thoughs on this collector thing and
> whether it would be usable in large installations. Or if not, what else
> would it need added.

Ooh I like this, I'll add some thoughts over in that thread.

> > That said, I ran mce-test with your patch on top of 3.18, looks good
> > there. But that doesn't simulate CMCI interrupts.
> 
> Thanks, I had a machine here which was the perfect natural error
> injector (has faulty DIMMs). I need to dig it out and play with it again
> :)
> 
> Thanks.
> 
> ---
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c 
> b/arch/x86/kernel/cpu/mcheck/mce.c
> index e3f426698164..8cf2f486b81e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1331,6 +1331,24 @@ int ce_error_seen(void)
>       return test_and_clear_bit(0, v);
>  }
>  
> +static void __restart_timer(struct timer_list *t, unsigned long interval)
> +{
> +     unsigned long when = jiffies + interval;
> +     unsigned long flags;
> +
> +     local_irq_save(flags);
> +
> +     if (timer_pending(t)) {
> +             if (time_before(when, t->expires))
> +                     mod_timer_pinned(t, when);
> +     } else {
> +             t->expires = round_jiffies(when);
> +             add_timer_on(t, smp_processor_id());
> +     }
> +
> +     local_irq_restore(flags);
> +}
> +
>  static void mce_timer_fn(unsigned long data)
>  {
>       struct timer_list *t = this_cpu_ptr(&mce_timer);
> @@ -1362,8 +1380,7 @@ static void mce_timer_fn(unsigned long data)
>  
>  done:
>       __this_cpu_write(mce_next_interval, iv);
> -     t->expires = jiffies + iv;
> -     add_timer_on(t, cpu);
> +     __restart_timer(t, iv);
>  }
>  
>  /*
> @@ -1372,16 +1389,10 @@ done:
>  void mce_timer_kick(unsigned long interval)
>  {
>       struct timer_list *t = this_cpu_ptr(&mce_timer);
> -     unsigned long when = jiffies + interval;
>       unsigned long iv = __this_cpu_read(mce_next_interval);
>  
> -     if (timer_pending(t)) {
> -             if (time_before(when, t->expires))
> -                     mod_timer_pinned(t, when);
> -     } else {
> -             t->expires = round_jiffies(when);
> -             add_timer_on(t, smp_processor_id());
> -     }
> +     __restart_timer(t, interval);
> +
>       if (interval < iv)
>               __this_cpu_write(mce_next_interval, interval);
>  }
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> Sent from a fat crate under my desk. Formatting is fine.
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to