Am Tue, 01 Jul 2014 17:23:14 +0200
Willem Jan Withagen <w...@digiware.nl> schrieb:

> On 2014-07-01 16:48, Rang, Anton wrote:
> > DOT => DOD
> >
> > 444F54 => 444F44
> >
> > That's a single-bit flip.  Bad memory, perhaps?
> 
> Very likely, especially if the system does not have ECC....
> It just happens on rare occasions that a alpha particle, power cycle, or 
> any things else disruptive damages a memory cell. And it could be that 
> it requires a special pattern of accesses to actually exhibit the error.
> 
> In the past (199x's) 'make buildworld' used to be a rather good memory 
> tester. But nowadays look at
>       http://www.memtest.org/
> 
> This tool has found all of the bad memory in all the systems I used and 
> or build for others...
> Note that it might take a few runs and some more heat to actually 
> trigger the faulty cell, but memtest86 will usually find it.
> 
> Note that on big systems with lots of memory it can take a loooooong 
> time to run just one full testset to completion.
> 
> --WjW
> 
> 
> >
> > Anton
> >
> > -----Original Message-----
> > From: owner-freebsd-curr...@freebsd.org 
> > [mailto:owner-freebsd-curr...@freebsd.org] On
> > Behalf Of O. Hartmann Sent: Tuesday, July 01, 2014 8:08 AM
> > To: Dimitry Andric
> > Cc: Adrian Chadd; FreeBSD CURRENT
> > Subject: Re: [CURRENT]: weird memory/linker problem?
> >
> > Am Mon, 23 Jun 2014 17:22:25 +0200
> > Dimitry Andric <d...@freebsd.org> schrieb:
> >
> >> On 23 Jun 2014, at 16:31, O. Hartmann <ohart...@zedat.fu-berlin.de> wrote:
> >>> Am Sun, 22 Jun 2014 10:10:04 -0700
> >>> Adrian Chadd <adr...@freebsd.org> schrieb:
> >>>> When they segfault, where do they segfault?
> >> ...
> >>> GIMP, LaTeX work, nothing special, but a bit memory consuming
> >>> regrading GIMP) I tried updating the ports tree and surprisingly the
> >>> tree is left over in a unclean condition while /usr/bin/svn segfault
> >>> (on console: pid 18013 (svn), uid 0: exited on signal 11 (core dumped)).
> >>>
> >>> Using /usr/local/bin/svn, which is from the devel/subversion port,
> >>> performs well, while FreeBSD 11's svn contribution dies as described. It 
> >>> did not
> >>> hours ago!
> >>
> >> I think what Adrian meant was: can you run svn (or another crashing
> >> program) in gdb, and post a backtrace?  Or maybe run ktrace, and see
> >> where it dies?
> >>
> >> Alternatively, put a core dump and the executable (with debug info) in
> >> a tarball, and upload it somewhere, so somebody else can analyze it.
> >>
> >> -Dimitry
> >>
> >
> > It's me again, with the same weird story.
> >
> > After a couple of days silence, the mysterious entity in my computer is 
> > back. This
> > time it is again a weird compiler message of failure (trying to buildworld):
> >
> > [...]
> > c++  -O2 -pipe -O3 -O3
> > c++ -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/tools/clang/include
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support -I.
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/../../lib/clang/include
> > -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS 
> > -D__STDC_CONSTANT_MACROS
> > -fno-strict-aliasing 
> > -DLLVM_DEFAULT_TARGET_TRIPLE=\"x86_64-unknown-freebsd11.0\"
> > -DLLVM_HOST_TRIPLE=\"x86_64-unknown-freebsd11.0\" -DDEFAULT_SYSROOT=\"\"
> > -Qunused-arguments -I/usr/obj/usr/src/tmp/legacy/usr/include -std=c++11
> > -fno-exceptions -fno-rtti -Wno-c++11-extensions
> > -c 
> > /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support/Host.cpp
> >  -o
> > Host.o --- GraphWriter.o --- In file included
> > from 
> > /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support/GraphWriter.cpp:14:
> >  
> > /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include/llvm/Support/GraphWriter.h:269:10:
> > error: use of undeclared identifier 'DOD'; did you mean 'DOT'? O <<
> > DOD::EscapeString(Label); ^~~
> > DOT 
> > /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include/llvm/Support/GraphWriter.h:35:11:
> > note: 'DOT' declared here namespace DOT {  // Private functions... ^ 1 error
> > generated. *** [GraphWriter.o] Error code 1
> >
> >
> > Well, in the past I saw many of those messages, especially not found labels 
> > of
> > routines in shared objects/libraries or even those "funny" misspelled 
> > messages shown
> > above.
> >
> > I can not reproduce them after a reboot, but as long as the system is 
> > running with
> > this error occured, it is sticky. So in order to compile the OS 
> > successfully, I
> > reboot.
> >
> > Does anyone have an idea what this could be? Since it affects at the moment 
> > only one
> > machine (the other CoreDuo has been retired in the meanwhile), it feels a 
> > bit like a
> > miscompilation on a certain type of CPU.
> >
> > Thanks for your patience,
> >
> > Oliver


Hello all.

Well, I'd like to update some informations. It doesn't relief the special 
concern, but
might be a kind of replenishment of experience.

The box in question is now with only 4GB - and is oprable as expected. With 8 
GB, I see
those reported weird bugs and they revealed themselfes as indeed bit flips. I 
can not
reproduce them, the occur spontanously, but I can raise the frequency by 
permutating the
RAM sticks. So far. As reported, the memtest86+ test doesn't show anything even 
after
three days(!) of testing!

The bos was built 2009 as a development system with 4GB RAM. That time, the 
developer
ordered special and expensive overclocker RAM, Ballistix, from Crucial. 
Usually, I
purchase JEDEC conform RAM - I have some allergic reaction to this stupid 
overclocking
and "planned destruction with fun" of silica by overdriving it. Especially when 
it
concerns equipment we have to rely on. The box has then been upgraded with 
further 4GB
RAM (two sticks) of the same type and brand, consuming 2+ volts (as far as I 
know).

Last summer, after 4 years of problem less operation, suddenly I had to fight 
with
spontanous crashes and blamed FBSD CURRENT, but very quickly the memory was 
revealed as
to be the culprit. The funny thing was: the box "roasted" literally the upper 4 
GB bank
and they got that hot, you might have burned your fingers seriously when 
touched (I
did!). The end of that game was, after a cascade of tests, swapping RAM sticks, 
that
those sticks in the upper slots (B1 and B2) where destroyed! After I exchanged 
the RAM
completely to JEDEC conform 8 GB, the system ran perfectly, until this summer 
again. When
in end of May the temperatures went beyon 20 degree Celsius in my lab, the bos 
started
having the issues with this bit flips.

I guess that there is a temperature triggered problem with the voltage 
regulation or
something killing slowly the RAM modules/sticks. This is only a guess. As I 
reported, the
chipset itself reports 81 - 85 degree C (in BIOS and with healthd). This high 
temperature
occured suddenly last year and I first thought that could be a mismeasurement.

After testing VBox and occupying all available memory without any obvious error 
or crash,
I tried compiling the OS and it seems that the notable load LLVM/CLANG rpoduces 
building
parallelised world/kernel triggers also this bit flip which results very fast 
in strange
errors as reported earlier in this thread. The ultimate failure arose when I 
tried to
install a Windows 7 on a free harddrive with 8 GB: the install process died 
with a file
corruption or not-copied file. I didn't dare to try the FreeBSD installation 
since I know
from the past that even FreeBSD's copying also triggers very fast hardware 
issues if any
available (overheating and sibblings). With 4 GB only everything works as 
expected, but 4
GB is a pain in the ass with ZFS and 11.0-CURRENT alone, not to mention the 
pain when
doing some memory intensive calculations/simulations or even VBox.

At the end, there is a mixed conclusion. I realise that I can not trust the 
expertise of
memtest86+. There is no suitable "burn-in" test for FreeBSD consuming, 
stressing,
tortouring memory and bus systems as well as all cores of the CPU starting with 
Core2Duo
CPUs, since cpuburn/burncpu of the ports do not utilise AVX/SIMD or other "hot" 
facilities
of modern Intel-like CPUs or stressing the integrated memory controller in a 
"brutal"
way. Prime95 is only available for i386 - and that is a pity on amd64 and > 4GB 
RAM.

At the end, there is no reason to purchase again a Workstation-grade mainboard, 
as
advertised by ASUS, for instance, with this overclocking crap. I leave behind a 
very
bitter taste - for my personal view. Since the memory problems I realised do 
not reveal
themselfes as "steady-state" problems, permanently, I fear data corruption not 
indicated
by any protection - so for the future, ECC is some kind of a must. And this 
means, even
for "low end" workstations, byebye cheap crappy Intel toy CPUs! At least a XEON 
type,
ECC capable processor is a prerequisite and I wish AMD had not followed the 
cheap man's
path ripping the ECC facilities off their consumer CPUs. It is a matter of fact 
that even
in the academic environment "cheap" ECCless systems are purchased for "cost
effectiveness". 

At the end, I personally wish for some massive tortouring tools like cpuburn or 
something
more sophisticated to stress the CPU to its limit - to test the reliability, 
the cooling
facilities and the energy support (power supply flaky under heavy load, etc.?). 
FreeBSD's
port do not have even the simplest Prime95 in a 64bit version as it is 
available for
Linux or Windows. I'm sure, some professionals are capable of pulling together 
some
massive stresstest tools, but please could this be made available for the not so
professionals and more "common" users? Maybe a naive Christmas wish?

I need to replace the system since I can not rely on that flaky box anymore, 
even when
using encrypted devices. That is, after a painful time and hopes, the final 
conclusion.

Regards and thanks for the patience reading this far,
Oliver

Attachment: signature.asc
Description: PGP signature

Reply via email to