Hi David,
On 3/18/20 8:10 PM, David Holmes wrote:
Hi Patricio,
On 19/03/2020 6:44 am, Patricio Chilano wrote:
Hi David,
On 3/18/20 4:27 AM, David Holmes wrote:
Hi Patricio,
On 18/03/2020 6:14 am, Patricio Chilano wrote:
Hi all,
Please review the following patch:
Bug: https://bugs.openjdk.java.net/browse/JDK-8240902
Webrev: http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/
Calling closeConnection() on an already created/opened connection
includes calls to CloseHandle() on objects that can still be used
by other threads. This can lead to either undefined behavior or, as
detailed in the bug comments, changes of state of unrelated objects.
This was a really great find!
Thanks! : )
This issue was found while debugging the reason behind some jshell
test failures seen after pushing 8230594. Not as important, but
there are also calls to closeStream() from
createStream()/openStream() when failing to create/open a stream
that will return after executing "CHECK_ERROR(enterMutex(stream,
NULL));" without closing the intended resources. Then, calling
closeConnection() could assert if the reason of the previous
failure was that the stream's mutex failed to be created/opened.
These patch aims to address these issues too.
Patch looks good in general. The internal reference count guards
deletion of the internal resources, and is itself safe because never
actually delete the connection. Thanks for adding the comment about
this aspect.
A few items:
Please update copyright year before pushing.
Done.
Please align ENTER_CONNECTION/LEAVE_CONNECTION macros the same way
as STREAM_INVARIANT.
Done.
170 unsigned int refcount;
171 jint state;
I'm unclear about the use of stream->state and connection->state as
guards - unless accessed under a mutex these would seem to at least
need acquire/release semantics.
Additionally the reads of refcount would also seem to need to some
form of memory synchronization - though the Windows docs for the
Interlocked* API does not show how to simply read such a variable!
Though I note that the RtlFirstEntrySList method for the
"Interlocked Singly Linked Lists" API does state "Access to the list
is synchronized on a multiprocessor system." which suggests a read
of such a variable does require some form of memory synchronization!
In the case of the stream struct, the state field is protected by the
mutex field. It is set to STATE_CLOSED while holding the mutex, and
threads that read it must acquire the mutex first through
sysIPMutexEnter(). For the cases where sysIPMutexEnter() didn't
acquire the mutex, we will return something different than SYS_OK and
the call will exit anyways. All this behaves as before, I didn't
change it.
Thanks for clarifying.
The refcount and state that I added to the SharedMemoryConnection
struct work together. For a thread closing the connection, setting
the connection state to STATE_CLOSED has to happen before reading the
refcount (more on the atomicity of that read later). That's why I
added the MemoryBarrier() call; which I see it's better if I just
move it to after setting the connection state to closed. For the
threads accessing the connection, incrementing the refcount has to
happen before reading the connection state. That's already provided
by the InterlockedIncrement() which uses a full memory barrier. In
this way if the thread closing the connection reads a refcount of 0,
then we know it's safe to release the resources, since other threads
accessing the connection will see that the state is closed after
incrementing the refcount. If the read of refcount is not 0, then it
could be that a thread is accessing the connection or not (it could
have read a state connection of STATE_CLOSED after incrementing the
refcount), we don't know, so we can't release anything. Similarly if
the thread accessing the connection reads that the state is not
closed, then we know it's safe to access the stream since anybody
closing the connection will still have to read refcount which will be
at least 1.
As for the atomicity of the read of refcount, from
https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access,
it states that "simple reads and writes to properly-aligned 32-bit
variables are atomic operations". Maybe I should declare refcount
explicitly as DWORD32?
It isn't the atomicity in question with the naked read but the
visibility. Any latency in the visibility of the store done by the
InterLocked*() function should be handled by the retry loop, but what
is to stop the C++ compiler from hoisting the read of refcount out of
the loop? It isn't even volatile (which has a stronger meaning in VS
than regular C+++).
I see what you mean now, I was thinking on atomicity and order of
operations but didn't consider the visibility of that read. Yes, if the
compiler decides to be smart and hoist the read out of the loop we might
never notice that it is safe to release those resources and we would
leak them for no reason. I see from the windows
docs(https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers)
that declaring it volatile as you pointed out should be enough to
prevent that.
Instead of having a refcount we could have done something similar to
the stream struct and protect access to the connection through a
mutex. To avoid serializing all threads we could have used SRW locks
and only the one closing the connection would do
AcquireSRWLockExclusive(). It would change the state of the
connection to STATE_CLOSED, close all handles, and then release the
mutex. ENTER_CONNECTION() and LEAVE_CONNECTION() would acquire and
release the mutex in shared mode. But other that maybe be more easy
to read I don't think the change will be smaller.
413 while (attempts>0) {
spaces around >
Done.
If the loop at 413 never encounters a zero reference_count then it
doesn't close the events or the mutex but still returns SYS_OK. That
seems wrong but I'm not sure what the right behaviour is here.
I can change the return value to be SYS_ERR, but I don't think there
is much we can do about it unless we want to wait forever until we
can release those resources.
SYS_ERR would look better, but I see now that the return value is
completely ignored anyway. So we're just going to leak resources if
the loop "times out". I guess this is the best we can do.
Here is v2 with the corrections:
Full: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/
Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/
<http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/> (not sure
why the indent fixes are not highlighted as changes but the Frames view
does show they changed)
I'll give it a run on mach5 adding tier5 as Serguei suggested.
Thanks,
Patricio
Thanks,
David
And please wait for serviceability folk to review this.
Sounds good.
Thanks for looking at this David! I will move the MemoryBarrier() and
change the refcount to be DWORD32 if you are okay with that.
Thanks,
Patricio
Thanks,
David
-----
Tested in mach5 with the current baseline, tiers1-3 and several
runs of open/test/langtools/:tier1 which includes the jshell tests
where this connector is used. I also applied patch
http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev
mentioned in the comments of the bug, on top of the baseline and
run the langtool tests with and without this fix. Without the fix
running around 30 repetitions already shows failures in tests
jdk/jshell/FailOverExecutionControlTest.java and
jdk/jshell/FailOverExecutionControlHangingLaunchTest.java. With the
fix I run several hundred runs and saw no failures. Let me know if
there is any additional testing I should do.
As a side note, I see there are a couple of open issues related
with jshell failures (8209848) which could be related to this bug
and therefore might be fixed by this patch.
Thanks,
Patricio