Re: RFR 8066708: JMXStartStopTest fails to connect to port 38112

Stuart Marks Tue, 16 Dec 2014 14:57:07 -0800

Hi Dmitry,

Strictly speaking you are correct. As soon as you close a socket, there is apossibility -- perhaps vanishingly small but nonzero -- that you might not beable to open it again.

The first scenario, where the user of the socket itself opens the socket usingan ephemeral port (e.g. new ServerSocket(0)) is of course preferred. This avoidsrace conditions entirely.

It's the second case that I'm still wrestling with, and maybe Jaroslav too. It'sfairly difficult to get such "black box" systems to open an ephemeral port andreport it back, as opposed to opening up their service on some port numberhanded in from the outside. (For RMI, rmid is the culprit here. I don't knowabout JMX.) What makes this difficult is that the rmid service is running in aseparate VM, so getting reliable information back from it can be difficult.

It's also fairly difficult to establish the retry logic in such cases. If theservice fails with a BindException, maybe -- maybe -- it was because there was aconflict over the port, and a retry is warranted. But this needs to bedistinguished from other failure modes that might occur, that should be reportedas failures instead of causing a retry. In principle, this is possible to do, ofcourse, it's just that it involves more restructuring of the tests, and possiblyadding debug/test code to rmid. (It may yet come to that.)

I'm still pondering the reasons that, in the open/close/reopen scenario, why thereopen might fail. The obvious reason is that some other process on the systemhas opened that port between the close and the reopen. I admit that this is apossibility. However, with the open/close/reopen scenario in place, we see teststhat fail up to 15% of the time with BindExceptions. This is an extraordinarilyhigh failure rate to be caused by some random other process happening to openthe same port in the few microseconds between the close and reopen. It's simplynot believable to me.

My thinking is still that the port isn't ready for reuse until a small amount oftime after it's closed. I have some test programs that exercise sockets in aparticular way (e.g., from multiple threads, or opening and closing batches ofsockets) that can reproduce the problem on some systems, and these test programsseem to behave better if a time delay is added between the close and the reopen.The exact circumstances under which the problem occurs is difficult to pin downand seems OS specific, and so choosing the "right" delay time is very difficult.But it does strengthen this conjecture in my mind.

Naturally it would be better if there were a way to determine when a port isavailable for reuse without actually opening it. I'm not aware of any such way,but I'm holding onto a little hope that one can be found.


s'marks



On 12/11/14 10:18 AM, Dmitry Samersoff wrote:

Stuart,

As soon as you close socket, you open a door for the race.

So you need another communication channel to pass a port number (or bind
result) between a client and a server without closing a socket on the
server side.

Typical scenario used by network related code is:

1. Server opens the socket
2. Server binds to port(0)
3. Server gets port number assigned by OS
4. Server informs client (e.g. write the port down to known file,
broadcast it etc)
5. Client establishes connection.

If the server is a blackbox and have to get a port number from outside,
scenario looks like:

WHILE(!success and !timeout)
1. Driver chooses random port number
2. Driver runs a server with this number
3. Driver checks that server is actually listening on this port
    (e.g. try to connect by it self)
WEND

4. Driver runs a client with this port number or bails out with
    descriptive error message.

-Dmitry

On 2014-12-11 20:53, Stuart Marks wrote:



On 12/11/14 7:09 AM, olivier.lagn...@oracle.com wrote:

On 11/12/2014 15:43, Dmitry Samersoff wrote:

You can set SO_LINGER to zero, in this case socket will be closed
immediately without waiting in TIME_WAIT

SO-LINGER did not help either in my case (see my previous mail to
Jaroslav).
That ended-up in using another hard-coded (supposedly free) port.
Note that was before RMI tests used randomly allocated ports.

But there are no reliable way to predict whether you can take this port
or not after you close it.

This is what I observed in my case.


So the only valid solution is to try to connect to a random port and if
this attempt fails try another random port. Everything else will cause
more or less frequent intermittent failures.

IIRC think this is what is currently done in RMI tests.


The RMI tests are still suffering from this problem, unfortunately.

The RMI test library gets a "random" port with "new ServerSocket(0)",
gets the port number, closes the socket, then returns the port to the
caller. The caller then assumes that it can use that port as it wishes.
That's when the BindException can occur. There are about 10 RMI test
bugs in the database that all seem to have this as their root cause.

There is some retry logic in RMI's test library, but that's to avoid the
so-called "reserved ports" that specific RMI tests use, or if "new
ServerSocket(0)" fails. It doesn't have anything to do with the
BindException that occurs when the caller attempts to reuse the port
with another socket.

My observation is also that setting SO_REUSEADDR has no effect. I
haven't tried SO_LINGER. My hunch is that it won't have any effect,
since the sockets in question aren't actually going into TIME_WAIT
state. But I suppose it's worth a try.

I don't have any solution for this; we're still discussing the issue. I
think the best approach would be to refactor the code so that the
eventual user of the socket opens it up on an ephemeral port in the
first place. That avoids the open/close/reopen business. Unfortunately
that doesn't help the case where you want to tell another JVM to run a
service on a specific port. We don't have a solution for that case yet.

The second-best approach (not really a solution) is to open/close a
serversocket to get the port, sleep for a little bit, then return the
port number to the caller. This might give the kernel a chance to clean
up the socket after the close. Of course, this still has a race
condition, but it might reduce the incidence of problems to an
acceptable level.

I'll let you know if we come up with anything better.

s'marks

Re: RFR 8066708: JMXStartStopTest fails to connect to port 38112

Reply via email to