Hi all,
Please review an updated webrev for this bug:
http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.1/
Using ConcurrentHashMap is much nicer in many ways, and it seems to resolve the
JCK failures at least as well as the previous fix did. This does nothing to
remedy the issue of snapshot consistency, which seems to be covered by 4945726,
which we'll postpone (again) to a future release.
s'marks
On 4/14/11 4:01 PM, Stuart Marks wrote:
Hi Peter,
Thanks for your comments on this issue. I understand from Alan that you have
some expertise in RMI; it's most welcome here.
It's good to hear that serialization compatibility isn't that big of a deal.
This makes things easier. It would seem reasonable to investigate replacing
HashMap with ConcurrentHashMap, which would remove a fair bit of the external
locking, and reduce or eliminate the need for a complex locking hierarchy along
with a big comment that describes it. This will also make it easier to retrofit
a coarser locking strategy if necessary; see below.
Converting instances of HM to CHM if necessary upon deserialization is probably
a good idea and isn't too difficult.
The larger issue of snapshot consistency is indeed troubling. I'd agree that
the ConcurrentModificationException is a symptom of a larger problem and that
making it go away (either with a different locking strategy or using CHM) is
mostly merely suppressing the symptom without addressing the larger issue. The
immediate need, though, is to fix the JCK failure, so I suspect what I'll need
to do is to push a fix for the CME and handle the resolution of any larger
issue separately.
I'll investigate an alternative fix that uses CHM instead of modified external
locking, and I'll post an updated webrev.
s'marks
On 4/13/11 11:14 PM, Peter Jones wrote:
Stuart,
A couple of comments so far:
I don't think that there would be a serialization compatibility problem with
changing Activation.groupTable to have a ConcurrentHashMap instead of a
HashMap. The field's declared type is Map, so the deserialization process
would be able to assign either class to the field, and the code doesn't
otherwise care about the Map implementation class. An rmid's state created
with CHM could not be recoverable with JDK 1.4, of course, but I don't think
that that's a concern. The fix would not help an rmid whose state gets
recovered from an earlier version created with HM, unless a replacement
(HM->CHM) was also done upon deserialization.
More troubling I think is the higher-level issue suggested by this
ConcurrentModificationException. Thread B (in the 6896297 Evaluation) is
attempting to write a "snapshot" of the entire persistent state of the
Activation instance, which presumably should be consistent with future
"updates" written to the log, while Thread A is concurrently attempting to
mutate that state (before writing an update for its mutation). It seems
highly doubtful that potentially including an uncertain amount of this
concurrent mutation in the written snapshot is safe. I might expect a coarser
lock to be used across all such mutations, together with their associated log
writes (at which point groupTable wouldn't need to be concurrent).
The approach at your webrev below seems OK to address the immediate
ConcurrentModificationException, but (unless I'm missing something) the
higher-level issue with snapshot correctness described above would remain.
Cheers,
-- Peter
P.S. This seems somewhat related to
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4945726
On Apr 7, 2011, at 6:19 PM, Stuart Marks wrote:
Hi Core-Libs developers,
I'd like to solicit some advice and discussion about this bug and a
potential fix I'm cooking for it. Here is the bug report; it contains
details about the problem and my analysis of it:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297
and here's a webrev of the fix I'm working on:
http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/
Briefly, the problem is incorrect synchronization of groupTable, a HashMap
field of an Activation object. The code mostly locks groupTable around any
access to it. However, no such locking is done during serialization. If the
groupTable is modified while it's being serialized,
ConcurrentModificationException occurs.
The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to
remove the external locking entirely. Unfortunately this will change the
serialized representation of the Activation object, which I'm not sure is
acceptable.
Assuming that we can't change the serialized represenation, the alternative
approach would be to make sure that locking is done properly during
serialization. This is fairly easy to do by locking groupTable in a
writeObject() method. Unfortunately, this introduces a deadlock.
This deadlock occurs because, with this modification, there are now paths
through the code that take locks in the opposite order. Specifically, the
addLogRecord() method locks the log object and then (via serialization and
the newly added writeObject() method) locks groupTable. However, the
unregisterGroup() method locks groupTable and calls
GroupEntry.unregisterGroup() which logs the event, which takes a lock on the
log.
After some analysis, I've determined that the call to
GroupEntry.unregisterGroup() can be moved outside the synchronization of
groupTable. This removes the ordering problem.
With these fixes in place (the state of the webrev above) I can get several
hundred successful test runs with neither ConcurrentModificationException
nor any deadlocks.
Of course, that doesn't mean the change is correct. :-)
Questions:
1. Is there a requirement that the serialized form of Activation remain
unchanged? If we can change it, we might as well just use ConcurrentHashMap
instead of HashMap.
2. Is my lock ordering analysis correct? I've pored over the code, but not
really being familiar with any RMI concepts, I'm not sure I have it right.
I've written this analysis into a big comment I've added to the code.
3. There is also a potential concurrency issue with idTable, which is used
similarly to groupTable. I haven't seen a test failure from it though. It
seems sensible to add a lock for it in Activation.writeObject() though. I
don't particularly like nesting the locks of idTable and groupTable, but
locking them separately would require serializing the Activation object
field by field instead of simply using defaultWriteObject(). Is this a
reasonable approach?
Thanks for any advice or comments.
s'marks