[jira] [Commented] (TINKERPOP-2813) Improve driver usability for cases where NoHostAvailableException is currently thrown

ASF GitHub Bot (Jira) Wed, 14 Dec 2022 08:52:14 -0800


    [ 
https://issues.apache.org/jira/browse/TINKERPOP-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647607#comment-17647607
 ]


ASF GitHub Bot commented on TINKERPOP-2813:
-------------------------------------------

FlorianHockmann commented on code in PR #1882:
URL: https://github.com/apache/tinkerpop/pull/1882#discussion_r1048686076


##########
docs/src/reference/gremlin-variants.asciidoc:
##########
@@ -658,6 +658,31 @@ social.persons("marko").knows("josh");
 NOTE: Using Maven, as shown in the `gremlin-archetype-dsl` module, makes 
developing DSLs with the annotation processor
 straightforward in that it sets up appropriate paths to the generated code 
automatically.
 
+[[gremlin-java-troubleshooting]]
+=== Troubleshooting
+
+*Max frame length of 65536 has been exceeded*
+
+This error occurs when the driver attempts to process a request/response that 
exceeds the configured maximum size.
+The most direct way to fix this problem is to increase the `maxContentLength` 
setting in the driver. Ideally, the
+`maxContentLength` set for the driver should match the setting defined on the 
server.
+
+*TimeoutException*
+
+A `TimeoutException` is thrown by the driver when the time limit assigned by 
the `maxWaitForConnection` is exceeded
+when trying to borrow a connection from the connection pool for a particular 
host. There are generally two scenarios
+where this occurs:
+
+1. The server has actually reached its maximum capacity or the driver has just 
learned that the server is unreachable.
+2. The client is throttling requests.

Review Comment:
   To be honest, when I just read this once it was not directly clear to me 
what that means. Why is it throttling requests?
   Simply adding "[...] because the pool is exhausted" would already help.



##########
gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java:
##########
@@ -334,10 +362,13 @@ private boolean addConnectionIfUnderMaximum() {
         }
 
         try {
-            connections.add(new Connection(host.getHostUri(), this, 
settings().maxInProcessPerConnection));
-        } catch (ConnectionException ce) {
-            logger.error("Connections were under max, but there was an error 
creating the connection.", ce);
+            connections.add(connectionFactory.create(this));
+        } catch (Exception ce) {

Review Comment:
   Really just a tiny nit, but this should probably be `e` now that it's no 
longer a `ConnectionException`.



##########
docs/src/upgrade/release-3.5.x.asciidoc:
##########
@@ -50,6 +50,48 @@ var client = new GremlinClient(new 
GremlinServer("localhost", 8182), loggerFacto
 
 See: link:https://issues.apache.org/jira/browse/TINKERPOP-2471[TINKERPOP-2471]
 
+==== gremlin-driver Host Availability
+
+The Java drivers approach to determine the health of the host to which it is 
connected is important to how it behaves.
+The driver has generally taken a pessimistic approach to make this 
determination, where a failure to borrow a
+connection from the pool would constitute enough evidence to mark the host as 
dead. Unfortunately a failure to borrow
+from the pool is not the best indicator for a dead host. Intermittent network 
failure, an overly busy client, or other
+temporary issues could have been at play while the server was perfectly 
available.
+
+The consequences for marking a host dead are fairly severe as the connection 
pool is shutdown. While it is shutdown
+in an orderly fashion so as to allow for existing requests to complete if at 
all possible, it immediately blocks new
+requests producing an immediate `NoHostAvailableException` (assuming there are 
actually no other hosts to route
+requests to). There is then some delay to reconnect to the server and 
re-establish each of the connections in the pool,
+which might take some time especially if there was a large pool. If an 
application were sending hundreds of request

Review Comment:
   ```suggestion
   which might take some time especially if there was a large pool. If an 
application were sending hundreds of requests
   ```



##########
docs/src/reference/gremlin-variants.asciidoc:
##########
@@ -658,6 +658,31 @@ social.persons("marko").knows("josh");
 NOTE: Using Maven, as shown in the `gremlin-archetype-dsl` module, makes 
developing DSLs with the annotation processor
 straightforward in that it sets up appropriate paths to the generated code 
automatically.
 
+[[gremlin-java-troubleshooting]]
+=== Troubleshooting
+
+*Max frame length of 65536 has been exceeded*
+
+This error occurs when the driver attempts to process a request/response that 
exceeds the configured maximum size.
+The most direct way to fix this problem is to increase the `maxContentLength` 
setting in the driver. Ideally, the
+`maxContentLength` set for the driver should match the setting defined on the 
server.
+
+*TimeoutException*
+
+A `TimeoutException` is thrown by the driver when the time limit assigned by 
the `maxWaitForConnection` is exceeded
+when trying to borrow a connection from the connection pool for a particular 
host. There are generally two scenarios
+where this occurs:
+
+1. The server has actually reached its maximum capacity or the driver has just 
learned that the server is unreachable.
+2. The client is throttling requests.
+
+The latter of the two can be addressed from the driver side in the following 
ways:

Review Comment:
   Not sure if that's necessary at all, but how about mentioning that the 
exception message / the error log message should contain information about the 
state of the `ConnectionPool` and its `Connections` to understand which of the 
two cases caused the problem?



##########
gremlin-server/src/test/java/org/apache/tinkerpop/gremlin/driver/ClientConnectionIntegrateTest.java:
##########
@@ -110,4 +124,100 @@ public void 
shouldCloseConnectionDeadDueToUnRecoverableError() throws Exception
         
assertThat(recordingAppender.logContainsAny("^(?!.*(isDead=false)).*isDead=true.*destroyed
 successfully.$"), is(true));
 
     }
+
+    /**
+     * Added for TINKERPOP-2813 - this scenario would have previously thrown 
tons of
+     * {@link NoHostAvailableException}.
+     */
+    @Test
+    public void shouldSucceedWithJitteryConnection() throws Exception {
+        final Cluster cluster = 
TestClientFactory.build().minConnectionPoolSize(1).maxConnectionPoolSize(4).
+                reconnectInterval(1000).
+                
maxWaitForConnection(4000).validationRequest("g.inject()").create();
+        final Client.ClusteredClient client = cluster.connect();
+
+        client.init();
+
+        // every 10 connections let's have some problems
+        final JitteryConnectionFactory connectionFactory = new 
JitteryConnectionFactory(3);
+        client.hostConnectionPools.forEach((h, pool) -> pool.connectionFactory 
= connectionFactory);
+
+        // get an initial connection which marks the host as available
+        assertEquals(2, client.submit("1+1").all().join().get(0).getInt());
+
+        // network is gonna get fishy - ConnectionPool should try to grow 
during the workload below and when it
+        // does some connections will fail to create in the background which 
should log some errors but not tank
+        // the submit() as connections that are currently still working and 
active should be able to handle the load.
+        connectionFactory.jittery = true;
+
+        // load up a hella ton of requests
+        final int requests = 1000;
+        final CountDownLatch latch = new CountDownLatch(requests);
+        final AtomicBoolean hadFailOtherThanTimeout = new AtomicBoolean(false);
+
+        new Thread(() -> {
+            IntStream.range(0, requests).forEach(i -> {
+                try {
+                    client.submitAsync("1 + " + i);
+                } catch (Exception ex) {
+                    // we could catch a TimeoutException here in some cases if 
the jitters cause a borrow of a
+                    // connection to take too long. submitAsync() will wrap in 
a RuntimeException. can't assert
+                    // this condition inside this thread or the test locks up
+                    hadFailOtherThanTimeout.compareAndSet(false, 
!(ex.getCause() instanceof TimeoutException));
+                } finally {
+                    latch.countDown();
+                }
+            });
+        }, "worker-shouldSucceedWithJitteryConnection").start();
+
+        // wait long enough for the jitters to kick in at least a little
+        while (latch.getCount() > 500) {

Review Comment:
   (nitpick) Is this `500` because then half the requests have been executed? 
In that case, I'd change this to `requests / 2` to make this clear.





> Improve driver usability for cases where NoHostAvailableException is 
> currently thrown
> -------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2813
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2813
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: driver
>    Affects Versions: 3.5.4
>            Reporter: Stephen Mallette
>            Assignee: Stephen Mallette
>            Priority: Blocker
>
> A {{NoHostAvailableException}} occurs in two cases:
> 1. where the {{Client}} is initialized and a failure occurs on all {{Host}} 
> instances configured
> 2. when the {{Client}} attempts to {{chooseConnection()}} to send a request 
> and all {{Host}} instances configured are marked unavailable.
> In the first case, you can get a cause for the failure which is helpful, but 
> the inadequacy is that you only get the failure of the first {{Host}} to 
> cause a problem. The second case is a bit worse because there you get no 
> cause in the exception and it's a "fast fail" in that as soon as the request 
> is sent there is no pause to see if the {{Host}} comes back online. Moreover, 
> a {{Host}} can be marked for failure for the infraction of just a single 
> {{Connection}} that may have just encountered a intermittent network issue, 
> thus quite quickly killing the entire {{ConnectionPool}} and turning 100s or 
> requests per second into 100s of {{NoHostAvailableException}} per second. 
> Note that you can also get an infraction for the pool just being overloaded 
> with requests which may signal that either the pool or server not being sized 
> right for the current workload - in either case, the 
> {{NoHostAvailableException}} is a bit of a harsh way to deal with that and in 
> any event doesn't quite give the user clues as to how to deal with it.
> All in all, this situation makes {{NoHostAvailableException}} hard to debug. 
> This ticket is meant to help smooth some of these problems. Initial thoughts 
> for improvements include better logging, ensuring that 
> {{NoHostAvailableException}} is not thrown without a cause, preferring more 
> specific exceptions in the fist place to {{NoHostAvailableException}}, 
> getting rid of "fast fails" in favor of longer pauses to see if a host can 
> recover and taking a softer stance on when a {{Host}} is actually considered 
> "unavailable".
> Expecting to implement this without breaking API changes, though exceptions 
> may shift around a bit, but will try to keep those to a minimum.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TINKERPOP-2813) Improve driver usability for cases where NoHostAvailableException is currently thrown

Reply via email to