[
https://issues.apache.org/jira/browse/TINKERPOP-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420651#comment-17420651
]
ASF GitHub Bot commented on TINKERPOP-2569:
-------------------------------------------
divijvaidya commented on pull request #1478:
URL: https://github.com/apache/tinkerpop/pull/1478#issuecomment-927738373
Let me try to phrase it as I understand the situation and tell me if I
missed anything.
As per the current code base, the definition of client initialisation is
that successful connections must have been established with at least one host
in the cluster.
Currently, when the client is trying to send it's first request, `init()`
will be called. `init()` tries to initializes connections across all hosts
registered in the cluster and if one of the hosts is down, client will log an
error. Note that if the initialization fails, the client will not throw an
error or provide any response indicating that initialization has failed. This
is intended behaviour because as per the definition of "successful
initialization", a client is successful if at least one of the host has
connected successfully.
If no host is successful, then the client initilization **would not fail
with a `NoHostAvailableException` exception**. This is because we only throw
this exception if the host has been marked as "available" in the cluster. It is
not necessary for a host to have active connections to be marked as "available"
in the cluster. Hence, even though connection initilization to the host has
failed, host is still marked as "active". This will lead to successfully
setting `Client.initialized` to true which is incorrect based on the definition
of client initialization.
The code change in this PR sets a flag `noLiveHostAvailable` if at least one
of the hosts fail during client initialization.
On the next request, it then tries to init() the client again for all the
hosts registered in the cluster.
Let's consider how this change will perform in different cases.
Case#1: One host registered to cluster and the host is down during client
init
In such cases, init() will not throw a `NoHostAvailableException` exception.
`Client.initialized` will be set to true and `Client.noLiveHostAvailable` will
be false. When a second request is made to the client, it will try to init()
again and if the host is alive at that time, it will succeed.
Note that the new flag is doing what the original `Client.initialized` flag
should have been doing here.
Case#2: Multiple hosts registered and only one host is down during client
init
In such cases, init() will not throw a `NoHostAvailableException` exception.
`Client.initialized` will be set to true and `Client.noLiveHostAvailable` will
be false. When a second request is made to the client, it will try to init()
again against all the hosts. This will lead to successful hosts also creating a
new connection pool (which is wrong and wasteful).
Due to above problems in case#2, this PR is not a suitable change.
I would propose the following alternative:
A client should:
1. Mark the host as available in the cluster only after connections to the
host have been initialised.
2. During client initialization, hosts with successful connections should be
marked as available and hosts with unsuccessful connection should be marked as
unavailable.
3. If no hosts are successful during client init, NoHostAvailable exception
should be thrown to the user and let the user handle this scenario. This is
because different applications might have different behaviour. Some might want
to fail their application startup while some might want to keep retrying until
successful. We should provide this flexibility to the applications to decide
this for themselves.
4. If some hosts are successful and some aren't, we should mark the
unavailable ones as `considerHostUnavailable()`. This would keep on retrying
the failed hosts in the backend while successful hosts are serving existing
connections.
I would suggest you to make the changes to follow the above points and that
will make the original test pass (since init() will be called on every call for
case of a single host, since client.initialized will be false)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Reconnect to server if Java driver fails to initialize
> ------------------------------------------------------
>
> Key: TINKERPOP-2569
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2569
> Project: TinkerPop
> Issue Type: Bug
> Components: driver
> Affects Versions: 3.4.11
> Reporter: Stephen Mallette
> Priority: Minor
>
> As reported here on SO:
> https://stackoverflow.com/questions/67586427/how-to-recover-with-a-retry-from-gremlin-nohostavailableexception
> If the host is unavailable at {{Client}} initialization then the host is not
> put in a state where reconnect is possible. Essentially, this test for
> {{GremlinServerIntegrateTest}} should pass:
> {code}
> @Test
> public void shouldFailOnInitiallyDeadHost() throws Exception {
> // start test with no server
> this.stopServer();
> final Cluster cluster = TestClientFactory.build().create();
> final Client client = cluster.connect();
> try {
> // try to re-issue a request now that the server is down
> client.submit("g").all().get(3000, TimeUnit.MILLISECONDS);
> fail("Should throw an exception.");
> } catch (RuntimeException re) {
> // Client would have no active connections to the host, hence it
> would encounter a timeout
> // trying to find an alive connection to the host.
> assertThat(re.getCause(),
> instanceOf(NoHostAvailableException.class));
> //
> // should recover when the server comes back
> //
> // restart server
> this.startServer();
> // try a bunch of times to reconnect. on slower systems this may
> simply take longer...looking at you travis
> for (int ix = 1; ix < 11; ix++) {
> // the retry interval is 1 second, wait a bit longer
> TimeUnit.SECONDS.sleep(5);
> try {
> final List<Result> results =
> client.submit("1+1").all().get(3000, TimeUnit.MILLISECONDS);
> assertEquals(1, results.size());
> assertEquals(2, results.get(0).getInt());
> } catch (Exception ex) {
> if (ix == 10)
> fail("Should have eventually succeeded");
> }
> }
> } finally {
> cluster.close();
> }
> }
> {code}
> Note that there is a similar test that first allows a connect to a host and
> then kills it and then restarts it again called {{shouldFailOnDeadHost()}}
> which demonstrates that reconnection works in that situation.
> I thought it might be an easy to fix to simply call
> {{considerHostUnavailable()}} in the {{ConnectionPool}} constructor in the
> event of a {{CompletionException}} which should kickstart the reconnect
> process. The reconnects started firing but they all failed for some reason. I
> didn't have time to investigate further than than.
> Currently the only workaround is to recreate the `Client` if this sort of
> situation occurs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)