Here's the updated webrev with the solution by Paul:
http://cr.openjdk.java.net/~igerasim/8024521/1/webrev/
I've also excluded the test from it.
Instead, I'm going to write a manual testing instruction for SQE.
I've tested the fix and it works as expected.
Sincerely yours,
Ivan
On 15.10.2013 10:29, Martin Buchholz wrote:
Hi
I'm the author of this code.
I'm perfectly willing to believe the close/drain code is buggy.
It is difficult to get right.
I'm looking forward to the updated webrev.
Often the best way to handle a stress test is to make it run quickly
by default, but provide a system property to make it run as long as
desired, for manual ad-hoc testing.
On Mon, Oct 14, 2013 at 9:04 AM, Ivan Gerasimov
<ivan.gerasi...@oracle.com <mailto:ivan.gerasi...@oracle.com>> wrote:
Thanks Paul!
Yes, I like your solution better.
I'll test it and send the updated webrev shortly.
Sincerely yours,
Ivan
On 14.10.2013 15:33, Paul Sandoz wrote:
On Oct 14, 2013, at 11:36 AM, Ivan Gerasimov
<ivan.gerasi...@oracle.com <mailto:ivan.gerasi...@oracle.com>>
wrote:
Hi Paul,
Thank you for looking at the webrev!
On 14.10.2013 12:50, Paul Sandoz wrote:
Hi Ivan,
Why do you require closeLock? is it not sufficient
leave processExited as is and just add:
public synchronized void close() throws IOException {
super.close();
}
I want to limit the impact on the code, that's why I
synchronize close() with only processExited().
If I made close() synchronized with all the stream's
methods, then I think it could introduce some
unpredictable drawback.
Yeah, the close will block until the drain is complete, which
might also have some unpredictable drawback.
A concern is there was already some code in place to check for
async close, which if detected throws away all the data and
sets the underlying filtered input stream is set to null. So
all that effort draining is wasted.
Another solution is to keep checking in the loop of
drainInputStream for an async close and fail-fast:
private InputStream drainInputStream(InputStream in)
throws IOException {
int n = 0;
int j;
byte[] a = null;
while ((j = in.available()) > 0) {
if (buf == null) // asynchronous close()?
return null; // stop reading and discard
a = (a == null) ? new byte[j] :
Arrays.copyOf(a, n + j);
n += in.read(a, n, j);
}
if (buf == null) // asynchronous close()?
return null; // discard
else if (a == null)
return ProcessBuilder.NullInputStream.INSTANCE;
else
return ByteArrayInputStream(n == a.length ? a
: Arrays.copyOf(a, n));
}
/** Called by the process reaper thread when the
process exits. */
synchronized void processExited() {
// Most BufferedInputStream methods are
synchronized, but close()
// is not, and so we have to handle concurrent
racing close().
try {
InputStream in = this.in <http://this.in>;
if (in != null) {
InputStream stragglers =
drainInputStream(in);
in.close();
this.in <http://this.in> = stragglers;
}
} catch (IOException ignored) {
// probably an asynchronous close().
}
}
This still has the potential to look at the availability of
the current input before async closing, and read from a
different input after async closing, but it will break out on
the next loop check if there are more bytes available to read
otherwise it will be caught again on the final check, so no
munged data will be exposed.
Plus there was already some logic checking if close
was asynchronously called (see BufferedInputStream.close):
372 if (buf == null) //
asynchronous close()?
373 this.in <http://this.in> = null;
and this would no longer be required
Yes, you're right and these two lines can be removed.
I wouldn't remove the surrounding try {} catch
(IOException) though, as the stream can still be closed
asynchronously.
I am somewhat struggling to understand the test (not
too familiar with this area). Requires a more expert
eye in this area than I.
IIUC you are trying to induce the the output process
fd of ExecLoop to "flip" to another fd opened by OpenLoop.
Yes.
I actually adopted a testcase provided by the reporter of
the problem.
I only changed the test in such a way that it can be run
in an automated mode.
Below is the exact comment from the original testcase.
It may be a good idea to include it in the regtest.
/* Following code is a test case that demonstrates a close
race in ProcessPipeInputStream.
*.
* A snippet of the offending code within
ProcessPipeInputStream:.
* // Called by the process reaper thread when the
process exits..
* synchronized void processExited() {
* // Most BufferedInputStream methods are
synchronized, but close()
* // is not, and so we have to handle concurrent
racing close().
* try {
* InputStream in = this.in <http://this.in>;
* if (in != null) {
* byte[] stragglers = drainInputStream(in);
* in.close();
*
* The comment even talks about the race. However, the code
is not adequately
* protecting against it. drainInputStream() (available()
in particular) is not
* threadsafe when a close happens in parallel. What's
worse is that a subsequent
* open() is likely to reassign the closed fd so
drainInputStream ends up draining
* the wrong fd! This can cause OOMs, data corruption on
other fds, and other very
* bad things.
*.
* Notes:.
* - Test creates a large sparse file in /tmp/bigsparsefile
* - Test uses "/proc/meminfo" for simple input..
* - Invoke using "-Xmx128m" to insure file size of 1G is
greater than heap size
* - Test will run forever, if successful there will not
be any OOM exceptions
* (on my test system they occur about 1 per minute)
* - You need multiple cores (I use 2 x Xeon E5620 = 8
cores, 16 threads).
*..
* The basic premise of the test is to create a 1GB sparse
file. Then have lots
* of threads opening and closing this file. In a loop
within a separate thread use UNIXprocess
* to run a simple command. We read the output of the
command, make sure the process
* has exited and then close the inputstream. When the bug
happens, processExited()
* will use FIONREAD to figure out how much data to drain
BUT the fd it is checking
* will be closed and reopened (now referring to
/tmp/bigsparsefile) so it will
* see the size as 1GB and then attempt to allocate a
buffer of that size and.
* then OOM..
*/
The test concerns me since it is gonna use up loads
of resources and proceed for 10 minutes until deciding
beyond reasonable doubt that there is no issue.
When I ran the test across different machines (with the
unpatched jdk) it reported from one to five races during
10 minutes.
We can surely decrease the timeout for the price of higher
probability of a false positive result.
Unit-like tests should really run in under a minute, plus
should ideally be frugal in their resource consumption
otherwise it could starve other tests running in parallel. If
we cannot create a more deterministic test (not sure it is
possible) then perhaps this test belongs to SQE and not in the
JDK tests?
Paul.