On Wed, Jan 24, 2007 at 11:11:53PM +0000, Alan wrote: > > I am going to assume that you are being facaetious, because it would be the > > rarified pinnacle of > > supreme arrogance to suggest that a cosmic ray event is a more likely > > explanation than a bug in > > the kernel. > > A one off non repeatable error experienced by two people out of the > millions using it does fit the cosmic ray description quite well. That's > not to say there isn't a bug, but you don't have enough data to even > begin debugging it unless its rather more reproducable.
The soft error rate (cosmic rays, alpha decay, etc.) for modern memory at sea level is estimated to be somewhere around 1000 - 5000 FIT/Mbit[1]. FIT is Failures in Time - errors per billion hours of use. If you've got 1GB of memory, you've got 8000Mbits. So you'd expect 8M - 40M errors per billion hours on your machine. Or 8 to 40 errors per 1000 hours. That's about one single-bit error per week to one per day. Yes, that's a lot. Can it really be that high? Big supercomputer installations actually measure it in errors per day or hour. Most of these errors will go completely unnoticed because they happen in data structures that aren't revisited (stale cache, unused code, empty memory). The remainder will often look like random disk read or write errors or random application bugs/crashes. Sound familiar? That's why people buy ECC memory. Now if we say that 10% of of that 1GB of RAM (~100MB) is kernel code/data (not including page cache) and that, say, 1-10% of errors trigger BUG/WARN code, we'll see these bug messages once every 100 days to once every 1000 weeks (per GB per user). As for the relative error rate vs kernel bugs - there are no shortage of Linux boxes with trouble-free uptimes much longer than the 100 days above. So yes, if a user reports a bug that's attributable to a single bit memory error that's otherwise unreproduced and unexplained, it's totally reasonable to chalk it up to cosmic rays until some sort of pattern of reports emerges. As for your particular bug: Eeek! page_mapcount(page) went negative! (-1) page->flags = 14 page->count = 0 page->mapping = 00000000 This check occurs whenever the last mapping is removed from a page. It's a very heavily used piece of code. The check is there as sanity-checking from when this logic was introduced. If there were a new bug here that could be triggered by gcc or telnet, odds are very good that it would trigger for TONS of people. So more likely theories are: a) pointer scribble from something completely unrelated or b) cosmic rays. As the nearby data (flags, count, mapping) doesn't appear to be scribbled on, (a) looks less promising. [1] http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/