On 12/3/2010 1:27 AM, Patricia Shanahan wrote:
I'm currently hunting an intermittent bug found by the test
qa/src/com/sun/jini/test/impl/outrigger/matching/StressTestWithShutdown.td

After a failure on Hudson, I modified the .td file to make it fail more often by
increasing the number of entries (10,000), readers (1000), and writers (1000).

The writers write entries in an OutriggerServerImpl JavaSpace. The readers read,
and then take, entries that the writers wrote. Sometimes, a reader fails to find
an entry a writer claims to have written, causing a timeout.

The outrigger implementation depends on the class FastList which seems to use
the infamous Double Checked Locking idiom
(http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html)

The good news is that any memory model related error in FastList, or the related
class EntryHolder, would be a plausible cause of the observed symptom. The bad
news is that FastList and EntryHolder seem to have been written to be very
aggressively parallel, possibly by someone who was only familiar with
sequentially consistent memory. :-(

The important issue in FastList is that it was written with the JDK1.4 memory model. After moving River to Java 1.5, we'd have the JSR166 work and the new, consistent memory model where volatile has a true meaning. However, this code in particular is quite complex as you have noted, so even adjusting to the new memory model could be problematic.

Many people are using Dan Creswell's Blitz JavaSpaces implementation or commercial versions. I'm partially inclined to suggest that we should discuss EOL of outrigger at some point. Even though Javaspaces is a large part of what Jini has been recognized for, it has a focused audience and if we don't have someone with knowledge and interest to support outrigger, it may be more of a wart than River can deal with.

Usually, it is easy to fix a problem once it has been located. This may be a bit
more difficult, especially because I assume the parallelism is needed for
acceptable JavaSpace performance.

One of the issues that I've found in network intensive applications, is that the latency of communications is so huge compared to code paths, that all active threads will fairly quickly end up hovering on top of any use of "synchronized" so that there is always the worst case contention for such protected resources.

It's important to understand how to deal with this by either minimizing synchronization time, or avoiding funneling kinds of locking mechanisms.

Gregg Wonderly

Reply via email to