Richard Fish wrote:
On 11/30/06, Vladimir G. Ivanovic <[EMAIL PROTECTED]> wrote:
I have done nothing to my hardware and I've seen this error, oh, a
half a dozen times, the last time 3 months (?) ago. I ran memtest when
I installed new memory, and it did not report problems even when run
for hours.

memtest is basically useless these days.  It can only tell you if you
have a bad memory cell, which almost never happens today.  Most memory
problems are the result of timing issues between the processor(s) and
DMA controllers.

This script [1] seems to be a much better memory test for modern
systems, although you may have to make some tweaks to run it on
Gentoo.

Just for kicks I'll run the script and see what happens.


And I do not get random segfaults with other programs.

Yes, compiling is very unique in this regard.  The memory access
pattern of a compiler, reading and writing to locations on different
rows, or even different modules, under high CPU load and using lots of
memory, with some IO thrown in for good measure, tends to reveal
hardware problems quite nicely.

Finally, I don't think my hardware fixed itself.

Given all of this, my suspicion is that these errors are software
bugs, not hardware problems.

For grins, here is part of comment #174:

        Random segfaults during compilation. ... in general a sign of
        hardware problems.

        // No, this is in general a sign of GCC 4.1 - problem ;-)

If we were talking about a driver, or an event-based GUI program, I
might agree.  But a compiler is going to take the exact same actions
given the same input and options.  The compiler isn't going to do
something different between 2 different executions over the _exact_
same sources because it feels like it.

You're right at the logical level, but not at the physical level. Cache effects and different disk accesses are two physical differences that spring to mind. Temporary files will be in different physical sectors, or in the buffer cache or not; directories may or may not be in the directory cache. Depending on what else is running, the pattern of cache misses will be different.

I emerge with -j2. Plus I'm doing work while the emerges happen. The likelihood of the memory access pattern of two compiles being the same is precisely zero.



The other thing that I don't really believe is the part about "this
bug not being reproducible" as reported by portage/emerge/make/gcc.

Then you should read the gcc sources.  One of the patches applied by
Gentoo adds a retry loop when the compiler is about to exit with an
internal compiler error (ICE).  It retries the compile twice, and if
either of those succeeds, you get the "The bug is not reproducible"
message.

Interesting. I did not know that. But I don't get why gcc exits with an error when the second (or third) try succeeds? Why not just print a warning, perhaps at the end so it is noticeable? Most people will restart the entire emerge, which seems like a gargantuan amount of wasted effort since the re-compilation has succeeded.

It doesn't output anything because that would possibly
obscure the original error.

The gentoo devs probably added this loop to avoid more duplicates of [2].

-Richard

[1] http://people.redhat.com/dledford/memtest.html
[2] http://bugs.gentoo.org/show_bug.cgi?id=20600

--
gentoo-user@gentoo.org mailing list

Reply via email to