On Jul 29, 2015, at 4:32 AM, David Holmes <david.hol...@oracle.com> wrote: > > On 29/07/2015 5:57 PM, Kim Barrett wrote: >> ... The race is being fixed by reordering a pair >> of volatile assignments. > > While this seems logical for the failure at hand it isn't immediately obvious > to me that setting next before setting the ENQUEUED state won't cause a > problem for some other piece of code. Really these things need to be tightly > synchronized - I think the real bug is the unsynchronized fast-path for the > empty-queue case in poll(). While that change was deliberate in 6666739 this > side-effect was not realized and would have affected the change. I hope > Martin and/or Tom see this and chime in.
I thought the poll() fast-path from 6666739 was ok at the time it was made, and that it was the later removal of synchronization on the reference by 8014890 that lead to the race under discussion, and reinstating that synchronization on the reference was another way to fix this race. Turns out I was wrong about that. The poll() fast-path introduced the possibility of the following unexpected behavior when another thread is in r.enqueue() and is in the problematic window: !r.enqueue() && q.poll() == null => true This can happen because the synchronization on the references is only in ReferenceQueue.enqueue(). If it was instead in Reference.enqueue(), or if ReferenceQueue.Null.enqueue() also synchronized on the reference, this race would be prevented. The removal of reference synchronization opened the door wider, also allowing this unexpected behavior: r.isEnqueued() && q.poll() == null => true The combination of those changes is needed for the ReferenceEnqueue regression test to fail, since it requires r.isEnqueued() && !r.enqueue() && q.poll() == null => true I wouldn't want to revert the poll() fast-path, since that was added for specific performance reasons. I don't think I'd want to add back synchronization on the reference, but might be persuaded otherwise. 8029205 should be looked at in that case. I've looked carefully at uses of r.next and I don't think there is a problem with reordering the r.next and r.queue assignments. Of course, the existing code was looked at by a number of people without spotting the race under discussion. A way to think about this that helped me (I think) understand the locking structure used here is that q.head and r.next are part of the queue associated with the reference. We can manipulate those using the queue's lock as the basis for protection, so long as the the reference's queue isn't changed. But when we change the reference's queue, the queue (including r.next) must be in a consistent state. [This suggests the r.queue = NULL assignment in reallyPoll() should be moved later, though I think the assignment order in reallyPoll() doesn't matter.] > That aside the commentary is rather verbose, a simple: > > // Only set ENQUEUED state after the reference is enqueued > > would suffice (and add "volatiles ensure ordering" if you want that to be > clearer). I do get a bit wordy sometimes. How about this: // Update r.queue *after* adding to queue's list, to avoid // race between enqueued checks and poll(). Volatiles // ensure ordering. >> CR: >> https://bugs.openjdk.java.net/browse/JDK-8132306 >> >> Webrev: >> http://cr.openjdk.java.net/~kbarrett/8132306/webrev.00/ >> >> Testing: >> jprt with default and hotspot testsets >> >> Locally tested race in original code by insertion of sleep in enqueue and >> verifying ReferenceEnqueue regression test fails consistently as >> described in the bug report. >> >> Locally tested fixed code by insertion of sleeps at various points and >> verifying ReferenceEnqueue regression test still passes.