Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Tim Hockin

On 7/12/07, Andi Kleen <[EMAIL PROTECTED]> wrote:


> -- there may be other edge cases other than
> this one. I'm actually surprised that this wasn't a ring buffer to start
> with -- it certainly seems like it wanted to be one.

The problem with a ring buffer is that it would lose old entries; but
for machine checks you really want the first entries because
the later ones might be just junk.


Couldn't the ring just have logic to detect an overrun and stop
logging until that is alleviated?  Similar to what is done now.  Maybe
I am underestimating it..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Andi Kleen
On Thursday 12 July 2007 22:24:36 Joshua Wise wrote:
> Some time after my first[1] patch to mce.c, I had continued testing using
> the test method described there (inject hundreds of thousands of single-bit
> errors while reading from /dev/mcelog), and I came across a strange issue. 
> Once every couple test cycles (i.e., every hundred-thousand injects), the
> system would begin behaving very strangely. The serial port would stop
> responding entirely, and my SSH sessions would respond, but to one keystroke
> behind what I had typed. After two minutes, the watchdog timer would reboot
> the system.

Thanks for the analysis.

I guess at some point it would make sense to model the whole thing in spin or 
similar and find even the last race. Anyone interested? @)

> 
> At this point, mce_read() and mce_log() are blocking on each other, and will
> be for all eternity. mce_log() is waiting for the on_each_cpu() to complete
> so that the loop can complete, and on_each_cpu() (from mce_read()) is
> waiting for mce_log() to finish.

I suspect the right fix is to get rid of collect_tscs(). 

One possible way would be to get the TSCs from the mce_log table
before the synchronization instead of asking the CPUs.

Or switch it over to the NMI supporting RCU that was recently posted.


> -- there may be other edge cases other than 
> this one. I'm actually surprised that this wasn't a ring buffer to start
> with -- it certainly seems like it wanted to be one. 

The problem with a ring buffer is that it would lose old entries; but 
for machine checks you really want the first entries because
the later ones might be just junk.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Joshua Wise
Some time after my first[1] patch to mce.c, I had continued testing using
the test method described there (inject hundreds of thousands of single-bit
errors while reading from /dev/mcelog), and I came across a strange issue. 
Once every couple test cycles (i.e., every hundred-thousand injects), the
system would begin behaving very strangely. The serial port would stop
responding entirely, and my SSH sessions would respond, but to one keystroke
behind what I had typed. After two minutes, the watchdog timer would reboot
the system.

These seemed to be fairly classic cases of missing interrupts, so in one of
my remaining sessions, I did a 'cat /proc/interrupts' after the machine had
entered this state. Sure enough, interrupt counts on CPU0 were not
incrementing, although my network interrupts on CPU1 were.

After some diagnostics, it appears that there is a race condition between
mce_log() and mce_read() that can lead to cpu0 spinning forever in mce_log's
loop that tries to claim a next entry.

The race requires a large number of machine checks to be occurring in order
for it to happen. Consider the case in which we take a machine check that
was, as the comment in mce.c:518 (mce_read) might describe, was still
getting written before the synchronize. We are presently in
synchronize_sched. The system, at this time, looks like this:

mcelog.next: 0
mcelog.entry: {0} {0} {1} {0} {0} {0} {0} {0}

(For the sake of simplicity, a mcelog entry in my little world here has only
one property -- finished. 1 means that it is finished, and 0 means that it
isn't.)

In the normal case, the rest would get cleaned up by the subsequent loop,
looping from i = next to MCE_LOG_LEN. But, what happens if we take another
three machine checks while we're still in synchronize_sched()?
synchronize_sched() isn't just waiting for all mce_log()s to finish, but in
fact is waiting for all CPUs to be done, which could take up to a tick -- or
in the case of our very slow interrupt handler, longer. So, although on a
normally functioning system, this is unlikely, it is possible that we might
take another three machine checks in that span of time.

So, let's watch what happens. After the first machine check, we get:

mcelog.next: 1
mcelog.entry: {1} {0} {1} {0} {0} {0} {0} {0}

And after the second one, we get:

mcelog.next: 2
mcelog.entry: {1} {1} {1} {0} {0} {0} {0} {0}

But what happens on the third one? The third one will loop waiting for the
finished bit on the third entry (mcelog.entry[2]) to be cleared, since that
is what the loop under the synchronize_sched() is supposed to do. So,
normally, it'd be a sane assumption to say that eventually that bit would
get cleared.

But, in this case, we're looping forever in an interrupt. Even though the
synchronize_sched() will have finished, the on_each_cpu() will be looping
forever as well, waiting for the CPU that's processing the exception to get
around to running the collect_tscs() routine.

At this point, mce_read() and mce_log() are blocking on each other, and will
be for all eternity. mce_log() is waiting for the on_each_cpu() to complete
so that the loop can complete, and on_each_cpu() (from mce_read()) is
waiting for mce_log() to finish.

We've researched this recently, and we've come up with a few possible
solutions for this.

The first option is to make mce_log() time out if it has looped too many
times in the outer loop. I've instrumented this and found that it's
possible for the loop to run over ten million times in normal operation, but
it almost never goes above that. The net result, though, is that if we go
above that magic number, we should probably bail out, since it's unlikely
that we'll ever come back to life.

I don't really like this solution, though. It's obviously a nasty hack, and
doesn't solve the underlying condition. And, on top of that, we lose MCEs
when it breaks. Now, if we're under this much MCE load, we probably won't
miss them, but still, we lose one.

The other option is to give mce_log and mce_read a complete rewrite, it
seems. The code looks like it has been through quite a few iterations to fix
edge cases just like this one -- there may be other edge cases other than
this one. I'm actually surprised that this wasn't a ring buffer to start
with -- it certainly seems like it wanted to be one. The obvious
disadvantage to that is that new code is just as likely to be broken in
subtle ways as the old code originally was. It may end up cleaner in the
long run, but in the short run, it's a new codebase with different
properties than what we're used to.

I'm not quite sure what to do here. I'm leaning towards the mce_log timeout
method, just fixing this symptom for the time being. I don't like it much,
but in the absence of an option to not be evil[2], I guess we should go with
the least evil option possible.

I've come up with a patch that does this (inlined and attached), and I've
been banging on a system for the past 12 hours trying to break it like this.
It seems to b