We might be able to get some benefit more portably by writing a Java
program that spawns a specified number of threads, all of which
continuously increment the same AtomicLong. That will cause both CPU
load and, with multiple processor chips, bus activity. Run it, along
with River tests, specifying different numbers of processors.
Patricia
On 12/17/2013 8:07 PM, Peter wrote:
Thanks Patricia,
Good to have someone with experience who can explain why.
Your parstore program is interesting.
Perhaps I should obtain a second hand 8 core 32 way T2000, I'll investigate
this option in the new year.
In spite of fixing synchronization errors, I'm not seeing contention during any
tests with thread dumps, the standard tests probably aren't sufficiently
loading the system.
Regards,
Peter.
----- Original message -----
I don't think you need to assume increased JVM optimization to explain
the phenomenon. It is typical of what happens when increasing thread
safety in a system that has been subject to testing, but not full thread
safety design and code review.
Testing, and fixing test failures, tends to eliminate only bugs that
have high probability in the current environment of sequences of events
and their relative timings. Fixing a bug often shifts the probabilities.
A sequence that brings out another bug was low probability with the
fixed bug, but high probability without it.
It may be worth trying a variation of a program, parstore, I wrote at
Sun Microsystems, but you can only do this sort of thing on a dedicated
system. Parstore spawned a specified number of threads, each of which
did nothing but the fastest form of store supported on Sun SPARC
systems, a dump of the floating point registers to memory, in a
continuous loop. By filling various queues, keeping buses busy, and
generally shifting timings around, it brought out several new test
failures.
Patricia
On 12/17/2013 10:48 AM, Peter wrote:
What I've found during the refactoring process is that fixing one
data race bug will expose another.
The latest test failure on multiple architectures has previously
never failed during testing (its the only test failure on ubuntu x64
and arm jdk7), here a boolean field is written to while synchronized
by an event listening thread, the field value is then read
unsynchronized by the test thread, twice in short succession, the
first read should have caused the test to pass, but this read
experiences a data race. The second read of the field sees the
updated value and causes a test failure. Without the race condition
the test thread wouldn't arrive at this point.
I don't know the internal workings of the jvm, but it seems like the
more correct code is, the more agressively the jvm optimises.
So now I've fixed so many other issues, these race conditions are
starting to fail every time, rather than occassionally.