On 12/3/2010 1:27 AM, Patricia Shanahan wrote:
I'm currently hunting an intermittent bug found by the test
qa/src/com/sun/jini/test/impl/outrigger/matching/StressTestWithShutdown.td
After a failure on Hudson, I modified the .td file to make it fail more often by
increasing the number of entries (10,000), readers (1000), and writers (1000).
The writers write entries in an OutriggerServerImpl JavaSpace. The readers read,
and then take, entries that the writers wrote. Sometimes, a reader fails to find
an entry a writer claims to have written, causing a timeout.
The outrigger implementation depends on the class FastList which seems to use
the infamous Double Checked Locking idiom
(http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html)
The good news is that any memory model related error in FastList, or the related
class EntryHolder, would be a plausible cause of the observed symptom. The bad
news is that FastList and EntryHolder seem to have been written to be very
aggressively parallel, possibly by someone who was only familiar with
sequentially consistent memory. :-(
The important issue in FastList is that it was written with the JDK1.4 memory
model. After moving River to Java 1.5, we'd have the JSR166 work and the new,
consistent memory model where volatile has a true meaning. However, this code
in particular is quite complex as you have noted, so even adjusting to the new
memory model could be problematic.
Many people are using Dan Creswell's Blitz JavaSpaces implementation or
commercial versions. I'm partially inclined to suggest that we should discuss
EOL of outrigger at some point. Even though Javaspaces is a large part of what
Jini has been recognized for, it has a focused audience and if we don't have
someone with knowledge and interest to support outrigger, it may be more of a
wart than River can deal with.
Usually, it is easy to fix a problem once it has been located. This may be a bit
more difficult, especially because I assume the parallelism is needed for
acceptable JavaSpace performance.
One of the issues that I've found in network intensive applications, is that the
latency of communications is so huge compared to code paths, that all active
threads will fairly quickly end up hovering on top of any use of "synchronized"
so that there is always the worst case contention for such protected resources.
It's important to understand how to deal with this by either minimizing
synchronization time, or avoiding funneling kinds of locking mechanisms.
Gregg Wonderly