[
https://issues.apache.org/jira/browse/TINKERPOP-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647721#comment-17647721
]
ASF GitHub Bot commented on TINKERPOP-2813:
-------------------------------------------
vkagamlyk commented on code in PR #1882:
URL: https://github.com/apache/tinkerpop/pull/1882#discussion_r1049039097
##########
gremlin-server/src/test/java/org/apache/tinkerpop/gremlin/driver/ClientConnectionIntegrateTest.java:
##########
@@ -110,4 +124,100 @@ public void
shouldCloseConnectionDeadDueToUnRecoverableError() throws Exception
assertThat(recordingAppender.logContainsAny("^(?!.*(isDead=false)).*isDead=true.*destroyed
successfully.$"), is(true));
}
+
+ /**
+ * Added for TINKERPOP-2813 - this scenario would have previously thrown
tons of
+ * {@link NoHostAvailableException}.
+ */
+ @Test
+ public void shouldSucceedWithJitteryConnection() throws Exception {
+ final Cluster cluster =
TestClientFactory.build().minConnectionPoolSize(1).maxConnectionPoolSize(4).
+ reconnectInterval(1000).
+
maxWaitForConnection(4000).validationRequest("g.inject()").create();
+ final Client.ClusteredClient client = cluster.connect();
+
+ client.init();
+
+ // every 10 connections let's have some problems
+ final JitteryConnectionFactory connectionFactory = new
JitteryConnectionFactory(3);
+ client.hostConnectionPools.forEach((h, pool) -> pool.connectionFactory
= connectionFactory);
+
+ // get an initial connection which marks the host as available
+ assertEquals(2, client.submit("1+1").all().join().get(0).getInt());
+
+ // network is gonna get fishy - ConnectionPool should try to grow
during the workload below and when it
+ // does some connections will fail to create in the background which
should log some errors but not tank
+ // the submit() as connections that are currently still working and
active should be able to handle the load.
+ connectionFactory.jittery = true;
+
+ // load up a hella ton of requests
+ final int requests = 1000;
+ final CountDownLatch latch = new CountDownLatch(requests);
+ final AtomicBoolean hadFailOtherThanTimeout = new AtomicBoolean(false);
+
+ new Thread(() -> {
+ IntStream.range(0, requests).forEach(i -> {
+ try {
+ client.submitAsync("1 + " + i);
+ } catch (Exception ex) {
+ // we could catch a TimeoutException here in some cases if
the jitters cause a borrow of a
+ // connection to take too long. submitAsync() will wrap in
a RuntimeException. can't assert
+ // this condition inside this thread or the test locks up
+ hadFailOtherThanTimeout.compareAndSet(false,
!(ex.getCause() instanceof TimeoutException));
+ } finally {
+ latch.countDown();
+ }
+ });
+ }, "worker-shouldSucceedWithJitteryConnection").start();
+
+ // wait long enough for the jitters to kick in at least a little
+ while (latch.getCount() > 500) {
+ TimeUnit.MILLISECONDS.sleep(50);
+ }
+
+ // wait for requests to complete
+ assertTrue(latch.await(30000, TimeUnit.MILLISECONDS));
+
+ // make sure we had some failures for sure coming out the factory
+ assertThat(connectionFactory.getNumberOfFailures(),
is(greaterThan(0L)));
+
+ // if there was a exception in the worker thread, then it had better
be a TimeoutException
+ assertThat(hadFailOtherThanTimeout.get(), is(false));
+
+ connectionFactory.jittery = false;
Review Comment:
This `connectionFactory` is not used more in this scope
```suggestion
```
> Improve driver usability for cases where NoHostAvailableException is
> currently thrown
> -------------------------------------------------------------------------------------
>
> Key: TINKERPOP-2813
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2813
> Project: TinkerPop
> Issue Type: Improvement
> Components: driver
> Affects Versions: 3.5.4
> Reporter: Stephen Mallette
> Assignee: Stephen Mallette
> Priority: Blocker
>
> A {{NoHostAvailableException}} occurs in two cases:
> 1. where the {{Client}} is initialized and a failure occurs on all {{Host}}
> instances configured
> 2. when the {{Client}} attempts to {{chooseConnection()}} to send a request
> and all {{Host}} instances configured are marked unavailable.
> In the first case, you can get a cause for the failure which is helpful, but
> the inadequacy is that you only get the failure of the first {{Host}} to
> cause a problem. The second case is a bit worse because there you get no
> cause in the exception and it's a "fast fail" in that as soon as the request
> is sent there is no pause to see if the {{Host}} comes back online. Moreover,
> a {{Host}} can be marked for failure for the infraction of just a single
> {{Connection}} that may have just encountered a intermittent network issue,
> thus quite quickly killing the entire {{ConnectionPool}} and turning 100s or
> requests per second into 100s of {{NoHostAvailableException}} per second.
> Note that you can also get an infraction for the pool just being overloaded
> with requests which may signal that either the pool or server not being sized
> right for the current workload - in either case, the
> {{NoHostAvailableException}} is a bit of a harsh way to deal with that and in
> any event doesn't quite give the user clues as to how to deal with it.
> All in all, this situation makes {{NoHostAvailableException}} hard to debug.
> This ticket is meant to help smooth some of these problems. Initial thoughts
> for improvements include better logging, ensuring that
> {{NoHostAvailableException}} is not thrown without a cause, preferring more
> specific exceptions in the fist place to {{NoHostAvailableException}},
> getting rid of "fast fails" in favor of longer pauses to see if a host can
> recover and taking a softer stance on when a {{Host}} is actually considered
> "unavailable".
> Expecting to implement this without breaking API changes, though exceptions
> may shift around a bit, but will try to keep those to a minimum.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)