Re: jmx-dev Review request: 8049303: Transient network problems cause JMX thread to fail silenty

shanliang Wed, 10 Sep 2014 12:46:54 -0700

shanliang wrote:

Daniel Fuchs wrote:

Hi,


Thanks for the detailed explanations.

The fact that the server doesn't store any client state
and can arbitrarily close the connection, leaving it to
the client to reestablish the connection, makes all this
code quite tricky and hard to follow.

Yes it is complicated, we allowed a server to close a client after aspecific timeout because in some case a client was dead but the serverneeded long long time to be informed, that could make memory problem.


I believe what you propose - making sure that the
notification thread doesn't stop without closing the
connection, is the right thing to do.

I wonder however if the code that closes the connection
should better be moved to ClientNotifForwarder::fetchNotifs?

ClientNotifForwarder::fetchNotifs has the following statement:

601         if (!shouldStop()) {
602             logger.error("NotifFetcher-run",
603                          "Failed to fetch notification, " +
604                          "stopping thread. Error is: " + ioe, ioe);
605             logger.debug("NotifFetcher-run",ioe);
607         }

Then it proceeds to return null, which causes the thread to die.

ClientNotifForwarder is an abstract super class and does not know howto close a connection, this class is extended byRMIConnector.RMINotifClient and JMXMP, if we modify the class to haveconnection reference, that might make problem for JMXMP.


It looks as if that's the place where the connection should ideally
be closed if it is not already closed, because it would ensure
that the thread never dies silently.

Otherwise I'd suggest improving the comment below:

1369                             // JDK-8049303
1370                             // possible again transient or a

1371 // non-deserialization exception,not know how

1372                             // to treat, close the connection

May I suggest something like:

// JDK-8049303
// We received an IOException - but the communicatorAdmin
// did not close the connection - possibly because the
// the original exception was raised by a transient network
// problem?
// We already know that this exception is not due to a deserialization
// issue as we already took care of that before involving the
// communicatorAdmin. Moreover - we already made one retry attempt
// at fetching the next batch of notifications - and the
// problem persisted.
// Since trying again doesn't seem to solve the issue, we will now
// close the connection. Doing otherwise might cause the
// NotifFetcher thread to die silently.

Yes more clear, here is the new webrev:

http://cr.openjdk.java.net/~sjiang/JDK-8049303/01/

Oh, not one retry attempt fetching the next batch of notifications, butthe *SAME* batch of notifications.

http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/<http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>


Shanliang


Thanks Daniel!
Shanliang


best regards,

-- daniel

On 9/10/14 5:42 PM, shanliang wrote:

Hi,

The issue could happen like this:
1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
2) communicatorAdmin.gotIOException(ioe) was called to check the
connection, it did not close the connection because the connection was
now OK.
3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original
exception and found it was not a dersialization exception, it re-threw
the original IOException
4) the caller ClientNotifForwarder did not know how to treat this
exception, decided to end silently.

The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:

if the fetchNotifs request gets an IOException, we examine the chain of

exceptions to determine whether this is a deserialization issue. Ifso -

we propagate the appropriate exception to the caller, who will then
proceed with fetching notifications one by one, otherwise we call
communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
    1) the call returns OK, means the connection is re-established, we
re-call the fetchNotifs;
    2) the call throws IOException, we check the connection status:
        2-1) "terminated", that means the connection is closed, we
re-throw the original IOException, the caller will end silently.
        2-2) not "terminated", we add a flag "retried" for this
situation, if the flag is false, we set the flag to true and re-do the
fetchNotifs request, this is useful for a transient network problem,

otherwise we close the connection and re-throw the originalIOException,

it is here we fix the bug.

We do not modify communicatorAdmin.gotIOException(ioe), it is calledtoo

by all other remote requests.

It is not easy to have a test reproducing the bug.

Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/
<http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>

Thanks,
Shanliang

Re: jmx-dev Review request: 8049303: Transient network problems cause JMX thread to fail silenty

Reply via email to