Re: advice & review requested for 6896297, race condition in rmid causing JCK failure

Stuart Marks Tue, 12 Apr 2011 14:42:35 -0700

Hi David,

Thanks for your notes and analysis. Good point about the temporaryinconsistency between groupEntry and groupTable. With the change, there's abrief moment where a GroupEntry in the registered state can be observed to beabsent from the groupTable, which wasn't possible before. I don't think this isa problem though. I can't find any places where operations on a GroupEntry needto look in the groupTable. (If they did exist, they'd have run into the samedeadlock that I ran into in making this change.) Most paths through the codeseem to first fetch an entry from groupTable and then operate on the entry.Somebody could remove and unregister the entry from groupTable between thefetch and the operation on the entry, but that can already occur in absence ofthis change.

In general the style of this code seems to be to lock individual datastructures piecemeal, so I wouldn't be surprised if there are already caseswhere they can be inconsistent temporarily. For example, it seems like thereought to be some relationship between the members of idTable and groupTable,yet they are locked and updated independently through quite different codepaths. This is dangerously close to a line of reasoning that goes like "it'salready inconsistent, so making it a little more inconsistent doesn't hurt"which I detest. However, I don't have any better ideas at the moment.

Also note that switching from HM to CHM (regardless of what we do with theserialized form) has the same issue. Doing this will protect the mapsthemselves -- and avoid ConcurrentModificationException, the original cause ofthis problem -- but it won't preserve any consistency between the maps.

Thanks for mentioning startupLock. It seemed suspicious to me too, but I thinkaddressing those issues is out of scope for this bugfix.


I'll talk about serialization more in my forthcoming reply to Alan.

s'marks

On 4/7/11 4:19 PM, David Holmes wrote:

Hi Stuart,

I can't answer your specific questions as I'm not familiar with the RMI
infrastructure either. Taking the lock in writeObject and moving
group.unregister out of the sync block to avoid deadlock seems reasonable. The
main question to ask with such a move is whether the temporary inconsistency in
state is something that can be observed and cause a problem.

The lock ordering analysis seems reasonable.

I do note that the startupLock protocol seems suspicious as there is a race
with detection of the lock value. I assume init itself can never be invoked by
more than one thread ;-) It may be that the object can only become accessible
concurrently at some time during init (ie it gets published) in which case the
protocol will work fine - but you really need a full understanding of the
lifecycle of the objects to determine that.

That aside, using ConcurrentHashMap would simplify the solution to the current
problem. If you are concerned about the serialized form then you could define
writeObject and readObject such that they convert between CHM and plain HM
using the serialized-fields API.

Cheers,
David

Stuart Marks said the following on 04/08/11 08:19:

Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential
fix I'm cooking for it. Here is the bug report; it contains details about the
problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

Briefly, the problem is incorrect synchronization of groupTable, a HashMap
field of an Activation object. The code mostly locks groupTable around any
access to it. However, no such locking is done during serialization. If the
groupTable is modified while it's being serialized,
ConcurrentModificationException occurs.

The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to
remove the external locking entirely. Unfortunately this will change the
serialized representation of the Activation object, which I'm not sure is
acceptable.

Assuming that we can't change the serialized represenation, the alternative
approach would be to make sure that locking is done properly during
serialization. This is fairly easy to do by locking groupTable in a
writeObject() method. Unfortunately, this introduces a deadlock.

This deadlock occurs because, with this modification, there are now paths
through the code that take locks in the opposite order. Specifically, the
addLogRecord() method locks the log object and then (via serialization and
the newly added writeObject() method) locks groupTable. However, the
unregisterGroup() method locks groupTable and calls
GroupEntry.unregisterGroup() which logs the event, which takes a lock on the
log.

After some analysis, I've determined that the call to
GroupEntry.unregisterGroup() can be moved outside the synchronization of
groupTable. This removes the ordering problem.

With these fixes in place (the state of the webrev above) I can get several
hundred successful test runs with neither ConcurrentModificationException nor
any deadlocks.

Of course, that doesn't mean the change is correct. :-)

Questions:

1. Is there a requirement that the serialized form of Activation remain
unchanged? If we can change it, we might as well just use ConcurrentHashMap
instead of HashMap.

2. Is my lock ordering analysis correct? I've pored over the code, but not
really being familiar with any RMI concepts, I'm not sure I have it right.
I've written this analysis into a big comment I've added to the code.

3. There is also a potential concurrency issue with idTable, which is used
similarly to groupTable. I haven't seen a test failure from it though. It
seems sensible to add a lock for it in Activation.writeObject() though. I
don't particularly like nesting the locks of idTable and groupTable, but
locking them separately would require serializing the Activation object field
by field instead of simply using defaultWriteObject(). Is this a reasonable
approach?

Thanks for any advice or comments.

s'marks

Re: advice & review requested for 6896297, race condition in rmid causing JCK failure

Reply via email to