[ https://issues.apache.org/jira/browse/TINKERPOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356410#comment-17356410 ]
Florian Hockmann commented on TINKERPOP-2573: --------------------------------------------- Thanks for reporting this and giving your thoughts on how we could improve this, [~olivertowers]! {quote}Gremlin server configurations can include a load-balancer in-front of the server nodes (Azure Cosmos DB is one example). {quote} That is definitely something that we haven't accounted for enough. The driver was originally built with the idea that a {{ConnectionPool}} holds connections to a single server. TINKERPOP-1765 should add support for multiple servers and then each server could have its own {{ConnectionPool}}. That's also how the Java driver operates if I'm not mistaken. But that of course doesn't help if multiple servers are behind a load balancer which is a setup that is getting more and more common in cloud environments or with containers in general. {quote}1. Modify {{FillPoolAsync}} to handle connection failures individually, doing 'best-effort' replacement of missing connections. 2 If {{FillPoolAsync}} observes an exception, still throw but do not dispose of connection tasks that were successful (and check that state of those successful connections and dispose if they are in error?). {quote} Aren't these two points basically the same? We could solve this by iterating over all connection tasks in the {{catch}} block, adding all connections from successfully finished tasks (maybe after also checking their connection state) to the pool and then throwing the exception. This would of course leave the pool in a state where it doesn't have its configured size, but the missing connections can still be added later, e.g., through the retry policy as you also already mentioned. {quote}3. In ConnectionPool constructor, apply reconnect retry policy to ReplaceDeadConnectionAsync() invocation {quote} I'm not sure about this one yet. If the server is already unavailable when the user creates a {{GremlinClient}}, then I think that it's best to fail fast to inform the user about the problem. I think it's more likely in this case that the user supplied invalid connection parameters or that the server is completely unreachable (e.g., due to a missing firewall exception) than a transient error where a retry policy could help. I also would like to avoid that calling a constructor can take a long time. Creating the connections in the constructor of course also already takes some time and is beyond of what a constructor should usually do. Maybe we could move this logic in some {{ConnectAsync}} method in the future. That method could be a static factory method and we could then make the constructor private, but that would of course be a breaking change so it would have to wait for 3.6.0. These are the two reasons I see against a retry policy in the constructor, but I'm also not completely against this. Do you frequently see the situation where calling the constructor throws due to a transient error or maybe an error with just one of many servers behind a load balancer? {quote}To allow the underlying errors to be observed by callers, a callback/event handler could be added that is invoked in the {{ReplaceClosedConnectionsAsync}} throws. This would allow for the any exceptions to be observed and collected in some manner. {quote} That's true. My take so far was that we should add logging to the driver (TINKERPOP-2471) to make things like these transparent to the user. Do you think that callbacks / an event handler would be a better approach than logging? I think logging would be a lot easier to use as users would only have to provide a logger. {quote}Further refinement of this: Have a way of populating {{ServerUnavailableException}} inner exception with the last observed replacement exception. {quote} Interesting idea. We currently don't do that as the connection replacement is handled asynchronously in the background by a different task, but we could of course save the last thrown exception and use that as the inner exception. > Gremlin.NET: Retry on CreateNewConnectionAsync() failures and avoid disposing > successful connections on a failure > ----------------------------------------------------------------------------------------------------------------- > > Key: TINKERPOP-2573 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2573 > Project: TinkerPop > Issue Type: Improvement > Components: dotnet > Affects Versions: 3.4.11 > Reporter: Oliver Towers > Priority: Minor > > [FillPoolAsync|https://github.com/apache/tinkerpop/blob/dc4a5d8795e836a35ce4b61247b9693fa0cea8cb/gremlin-dotnet/src/Gremlin.Net/Driver/ConnectionPool.cs#L119] > attempts to create connections to fill the pool up to the target > {{PoolSize}}, issuing {{CreateNewConnectionAsync}} to build-out or recover > the delta. > If one of these tasks fault (eg. due to some socket error during websocket > handshake), then all successful connections are disposed and an exception is > thrown, which can result in either: > * {{ConnectionPool}} constructor throws, so the only option is to retry > creating the client. > * A fire-and-forget task for {{ReplaceDeadConnectionAsync}} fails and the > connection pool cannot recover the pool size at that time. The > {{GetAvailableConnectionAsync}} retry policy can kick-in here and the pool > may eventually recover. > Gremlin server configurations can include a load-balancer in-front of the > server nodes (Azure Cosmos DB is one example). So there are cases where > faults with new connections, are transient, but successful/healthy connection > attempts can be maintained without destroying them. > Given this the proposal is the following: > # Modify {{FillPoolAsync}} to handle connection failures individually, doing > 'best-effort' replacement of missing connections. > # If {{FillPoolAsync}} observes an exception, still throw but do not dispose > of connection tasks that were successful (and check that state of those > successful connections and dispose if they are in error?). > # In {{ConnectionPool}} constructor, apply reconnect retry policy to > {{ReplaceDeadConnectionAsync()}} invocation > ## Retry on any Exception (since there are various IO/Socket exceptions that > can be thrown here). This maintains the previous behavior where the pool must > be filled in-full at construction, and throws the underlying exception to the > caller. > *Additional note:* > Post construction, if {{GetAvailableConnection}} triggers a connection > replacement fire-and-forget task where the task encounters network > IO/handshake error, then these exceptions cannot be observed directly by the > caller. Only a {{ServerUnavailableException}} is thrown if the retries are > exhausted. > To allow the underlying errors to be observed by callers, a callback/event > handler could be added that is invoked in the > {{ReplaceClosedConnectionsAsync}} throws. This would allow for the any > exceptions to be observed and collected in some manner. > Further refinement of this: Have a way of populating > {{ServerUnavailableException}} inner exception with the last observed > replacement exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)