[ https://issues.apache.org/jira/browse/TINKERPOP-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647607#comment-17647607 ]
ASF GitHub Bot commented on TINKERPOP-2813: ------------------------------------------- FlorianHockmann commented on code in PR #1882: URL: https://github.com/apache/tinkerpop/pull/1882#discussion_r1048686076 ########## docs/src/reference/gremlin-variants.asciidoc: ########## @@ -658,6 +658,31 @@ social.persons("marko").knows("josh"); NOTE: Using Maven, as shown in the `gremlin-archetype-dsl` module, makes developing DSLs with the annotation processor straightforward in that it sets up appropriate paths to the generated code automatically. +[[gremlin-java-troubleshooting]] +=== Troubleshooting + +*Max frame length of 65536 has been exceeded* + +This error occurs when the driver attempts to process a request/response that exceeds the configured maximum size. +The most direct way to fix this problem is to increase the `maxContentLength` setting in the driver. Ideally, the +`maxContentLength` set for the driver should match the setting defined on the server. + +*TimeoutException* + +A `TimeoutException` is thrown by the driver when the time limit assigned by the `maxWaitForConnection` is exceeded +when trying to borrow a connection from the connection pool for a particular host. There are generally two scenarios +where this occurs: + +1. The server has actually reached its maximum capacity or the driver has just learned that the server is unreachable. +2. The client is throttling requests. Review Comment: To be honest, when I just read this once it was not directly clear to me what that means. Why is it throttling requests? Simply adding "[...] because the pool is exhausted" would already help. ########## gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java: ########## @@ -334,10 +362,13 @@ private boolean addConnectionIfUnderMaximum() { } try { - connections.add(new Connection(host.getHostUri(), this, settings().maxInProcessPerConnection)); - } catch (ConnectionException ce) { - logger.error("Connections were under max, but there was an error creating the connection.", ce); + connections.add(connectionFactory.create(this)); + } catch (Exception ce) { Review Comment: Really just a tiny nit, but this should probably be `e` now that it's no longer a `ConnectionException`. ########## docs/src/upgrade/release-3.5.x.asciidoc: ########## @@ -50,6 +50,48 @@ var client = new GremlinClient(new GremlinServer("localhost", 8182), loggerFacto See: link:https://issues.apache.org/jira/browse/TINKERPOP-2471[TINKERPOP-2471] +==== gremlin-driver Host Availability + +The Java drivers approach to determine the health of the host to which it is connected is important to how it behaves. +The driver has generally taken a pessimistic approach to make this determination, where a failure to borrow a +connection from the pool would constitute enough evidence to mark the host as dead. Unfortunately a failure to borrow +from the pool is not the best indicator for a dead host. Intermittent network failure, an overly busy client, or other +temporary issues could have been at play while the server was perfectly available. + +The consequences for marking a host dead are fairly severe as the connection pool is shutdown. While it is shutdown +in an orderly fashion so as to allow for existing requests to complete if at all possible, it immediately blocks new +requests producing an immediate `NoHostAvailableException` (assuming there are actually no other hosts to route +requests to). There is then some delay to reconnect to the server and re-establish each of the connections in the pool, +which might take some time especially if there was a large pool. If an application were sending hundreds of request Review Comment: ```suggestion which might take some time especially if there was a large pool. If an application were sending hundreds of requests ``` ########## docs/src/reference/gremlin-variants.asciidoc: ########## @@ -658,6 +658,31 @@ social.persons("marko").knows("josh"); NOTE: Using Maven, as shown in the `gremlin-archetype-dsl` module, makes developing DSLs with the annotation processor straightforward in that it sets up appropriate paths to the generated code automatically. +[[gremlin-java-troubleshooting]] +=== Troubleshooting + +*Max frame length of 65536 has been exceeded* + +This error occurs when the driver attempts to process a request/response that exceeds the configured maximum size. +The most direct way to fix this problem is to increase the `maxContentLength` setting in the driver. Ideally, the +`maxContentLength` set for the driver should match the setting defined on the server. + +*TimeoutException* + +A `TimeoutException` is thrown by the driver when the time limit assigned by the `maxWaitForConnection` is exceeded +when trying to borrow a connection from the connection pool for a particular host. There are generally two scenarios +where this occurs: + +1. The server has actually reached its maximum capacity or the driver has just learned that the server is unreachable. +2. The client is throttling requests. + +The latter of the two can be addressed from the driver side in the following ways: Review Comment: Not sure if that's necessary at all, but how about mentioning that the exception message / the error log message should contain information about the state of the `ConnectionPool` and its `Connections` to understand which of the two cases caused the problem? ########## gremlin-server/src/test/java/org/apache/tinkerpop/gremlin/driver/ClientConnectionIntegrateTest.java: ########## @@ -110,4 +124,100 @@ public void shouldCloseConnectionDeadDueToUnRecoverableError() throws Exception assertThat(recordingAppender.logContainsAny("^(?!.*(isDead=false)).*isDead=true.*destroyed successfully.$"), is(true)); } + + /** + * Added for TINKERPOP-2813 - this scenario would have previously thrown tons of + * {@link NoHostAvailableException}. + */ + @Test + public void shouldSucceedWithJitteryConnection() throws Exception { + final Cluster cluster = TestClientFactory.build().minConnectionPoolSize(1).maxConnectionPoolSize(4). + reconnectInterval(1000). + maxWaitForConnection(4000).validationRequest("g.inject()").create(); + final Client.ClusteredClient client = cluster.connect(); + + client.init(); + + // every 10 connections let's have some problems + final JitteryConnectionFactory connectionFactory = new JitteryConnectionFactory(3); + client.hostConnectionPools.forEach((h, pool) -> pool.connectionFactory = connectionFactory); + + // get an initial connection which marks the host as available + assertEquals(2, client.submit("1+1").all().join().get(0).getInt()); + + // network is gonna get fishy - ConnectionPool should try to grow during the workload below and when it + // does some connections will fail to create in the background which should log some errors but not tank + // the submit() as connections that are currently still working and active should be able to handle the load. + connectionFactory.jittery = true; + + // load up a hella ton of requests + final int requests = 1000; + final CountDownLatch latch = new CountDownLatch(requests); + final AtomicBoolean hadFailOtherThanTimeout = new AtomicBoolean(false); + + new Thread(() -> { + IntStream.range(0, requests).forEach(i -> { + try { + client.submitAsync("1 + " + i); + } catch (Exception ex) { + // we could catch a TimeoutException here in some cases if the jitters cause a borrow of a + // connection to take too long. submitAsync() will wrap in a RuntimeException. can't assert + // this condition inside this thread or the test locks up + hadFailOtherThanTimeout.compareAndSet(false, !(ex.getCause() instanceof TimeoutException)); + } finally { + latch.countDown(); + } + }); + }, "worker-shouldSucceedWithJitteryConnection").start(); + + // wait long enough for the jitters to kick in at least a little + while (latch.getCount() > 500) { Review Comment: (nitpick) Is this `500` because then half the requests have been executed? In that case, I'd change this to `requests / 2` to make this clear. > Improve driver usability for cases where NoHostAvailableException is > currently thrown > ------------------------------------------------------------------------------------- > > Key: TINKERPOP-2813 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2813 > Project: TinkerPop > Issue Type: Improvement > Components: driver > Affects Versions: 3.5.4 > Reporter: Stephen Mallette > Assignee: Stephen Mallette > Priority: Blocker > > A {{NoHostAvailableException}} occurs in two cases: > 1. where the {{Client}} is initialized and a failure occurs on all {{Host}} > instances configured > 2. when the {{Client}} attempts to {{chooseConnection()}} to send a request > and all {{Host}} instances configured are marked unavailable. > In the first case, you can get a cause for the failure which is helpful, but > the inadequacy is that you only get the failure of the first {{Host}} to > cause a problem. The second case is a bit worse because there you get no > cause in the exception and it's a "fast fail" in that as soon as the request > is sent there is no pause to see if the {{Host}} comes back online. Moreover, > a {{Host}} can be marked for failure for the infraction of just a single > {{Connection}} that may have just encountered a intermittent network issue, > thus quite quickly killing the entire {{ConnectionPool}} and turning 100s or > requests per second into 100s of {{NoHostAvailableException}} per second. > Note that you can also get an infraction for the pool just being overloaded > with requests which may signal that either the pool or server not being sized > right for the current workload - in either case, the > {{NoHostAvailableException}} is a bit of a harsh way to deal with that and in > any event doesn't quite give the user clues as to how to deal with it. > All in all, this situation makes {{NoHostAvailableException}} hard to debug. > This ticket is meant to help smooth some of these problems. Initial thoughts > for improvements include better logging, ensuring that > {{NoHostAvailableException}} is not thrown without a cause, preferring more > specific exceptions in the fist place to {{NoHostAvailableException}}, > getting rid of "fast fails" in favor of longer pauses to see if a host can > recover and taking a softer stance on when a {{Host}} is actually considered > "unavailable". > Expecting to implement this without breaking API changes, though exceptions > may shift around a bit, but will try to keep those to a minimum. > -- This message was sent by Atlassian Jira (v8.20.10#820010)