On 7/14/16 12:46 PM, Ivan Gerasimov wrote:
Thank you David for looking into this!
Here's the webrev updated in accordance with your and Daniel's
suggestions:
http://cr.openjdk.java.net/~igerasim/8160892/01/webrev/
src/os/windows/vm/os_windows.cpp
L3900: bool registered = false;
Please consider adding a comment above this variable:
// We only attempt to register threads until a process exiting
// thread manages to set the process_exiting flag. Any threads
// that come through here after the process_exiting flag is set
// are unregistered and will be caught in the SuspendThread()
// infinite loop below.
My request is different than David's. I prefer the comment up
here by the 'registered' variable because that variable is
key for getting into the SuspendThread() infinite loop. The
comment also occurs before all the code paths that you have
trace out with the different Ept types and flags states.
I'm really hoping that this is the last tweak we need to make to
clearly comment what we're doing to alleviate this OS race.
The number of occurrences of the underlying bug keep going down
with every iteration so we're getting close.
Dan
Please see my answers inline
Nit: can we change 'registered_itself" to just "registered" please.
Done.
Can you explain under what conditions a thread will now reach the
self-suspension code. Is that only if an error occurred such that we
were unable to register our handle for the process-exiting thread to
wait on? If so some commentary on that block seems appropriate -
perhaps more appropriate there than back up at the place where it
failed to get the handle (as Dan requested).
There are three kinds of threads, which can be caught in that
self-suspension loop:
1) All threads that want to end (by calling _endthreadex()) *after*
some process-exiting thread raised the flag `process_exiting`.
The rationale here is that we know that the whole process is going to
be terminated quite soon, so we do not allow any thread to call
_endthreadex(), which seems to have the concurrency bug.
2) Any thread that wants to end the whole process, after some other
thread raised the flag `process_exiting`.
If more than one threads want to end the process, we let to do it only
the thread that could raise the flag `process_exiting`. All other
such threads will have to suspend themselves.
3) (Unlikely to happen in practice) Any thread that wants to end by
calling _endthreadex(), but which failed to register itself due to
failure of DuplicateHandle().
Here we still have a race, which can result in a wrong exit code of
the process.
Given we keep missing conditions I'm only cautiously optimistic about
this.
And I'd like to understand how we still sometimes end up exiting with
an "error code" that seems to be the value of an exception! :(
The last time the sentinel exit code =20115 was reported almost a year
ago.
After that the fix for JDK-8145127 had gone in, and I didn't see any
more reports about wrong exit codes since then.
In particular, that fix worked around the situation when more than one
threads concurrently call System.exit(), which might have caused a race.
With kind regards,
Ivan
Thanks,
David
With kind regards,
Ivan