[ 
https://issues.apache.org/jira/browse/TINKERPOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356410#comment-17356410
 ] 

Florian Hockmann commented on TINKERPOP-2573:
---------------------------------------------

Thanks for reporting this and giving your thoughts on how we could improve 
this, [~olivertowers]!
{quote}Gremlin server configurations can include a load-balancer in-front of 
the server nodes (Azure Cosmos DB is one example).
{quote}
That is definitely something that we haven't accounted for enough. The driver 
was originally built with the idea that a {{ConnectionPool}} holds connections 
to a single server. TINKERPOP-1765 should add support for multiple servers and 
then each server could have its own {{ConnectionPool}}. That's also how the 
Java driver operates if I'm not mistaken. But that of course doesn't help if 
multiple servers are behind a load balancer which is a setup that is getting 
more and more common in cloud environments or with containers in general.

 
{quote}1. Modify {{FillPoolAsync}} to handle connection failures individually, 
doing 'best-effort' replacement of missing connections.
 2 If {{FillPoolAsync}} observes an exception, still throw but do not dispose 
of connection tasks that were successful (and check that state of those 
successful connections and dispose if they are in error?).
{quote}
Aren't these two points basically the same? We could solve this by iterating 
over all connection tasks in the {{catch}} block, adding all connections from 
successfully finished tasks (maybe after also checking their connection state) 
to the pool and then throwing the exception.

This would of course leave the pool in a state where it doesn't have its 
configured size, but the missing connections can still be added later, e.g., 
through the retry policy as you also already mentioned.

 
{quote}3. In ConnectionPool constructor, apply reconnect retry policy to 
ReplaceDeadConnectionAsync() invocation
{quote}
I'm not sure about this one yet. If the server is already unavailable when the 
user creates a {{GremlinClient}}, then I think that it's best to fail fast to 
inform the user about the problem. I think it's more likely in this case that 
the user supplied invalid connection parameters or that the server is 
completely unreachable (e.g., due to a missing firewall exception) than a 
transient error where a retry policy could help.

I also would like to avoid that calling a constructor can take a long time. 
Creating the connections in the constructor of course also already takes some 
time and is beyond of what a constructor should usually do. Maybe we could move 
this logic in some {{ConnectAsync}} method in the future. That method could be 
a static factory method and we could then make the constructor private, but 
that would of course be a breaking change so it would have to wait for 3.6.0.

These are the two reasons I see against a retry policy in the constructor, but 
I'm also not completely against this. Do you frequently see the situation where 
calling the constructor throws due to a transient error or maybe an error with 
just one of many servers behind a load balancer?

 
{quote}To allow the underlying errors to be observed by callers, a 
callback/event handler could be added that is invoked in the 
{{ReplaceClosedConnectionsAsync}} throws. This would allow for the any 
exceptions to be observed and collected in some manner.
{quote}
That's true. My take so far was that we should add logging to the driver 
(TINKERPOP-2471) to make things like these transparent to the user. Do you 
think that callbacks / an event handler would be a better approach than 
logging? I think logging would be a lot easier to use as users would only have 
to provide a logger.
  
{quote}Further refinement of this: Have a way of populating 
{{ServerUnavailableException}} inner exception with the last observed 
replacement exception.
{quote}
Interesting idea. We currently don't do that as the connection replacement is 
handled asynchronously in the background by a different task, but we could of 
course save the last thrown exception and use that as the inner exception.

> Gremlin.NET: Retry on CreateNewConnectionAsync() failures and avoid disposing 
> successful connections on a failure
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2573
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2573
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: dotnet
>    Affects Versions: 3.4.11
>            Reporter: Oliver Towers
>            Priority: Minor
>
> [FillPoolAsync|https://github.com/apache/tinkerpop/blob/dc4a5d8795e836a35ce4b61247b9693fa0cea8cb/gremlin-dotnet/src/Gremlin.Net/Driver/ConnectionPool.cs#L119]
>  attempts to create connections to fill the pool up to the target 
> {{PoolSize}}, issuing {{CreateNewConnectionAsync}} to build-out or recover 
> the delta.
> If one of these tasks fault (eg. due to some socket error during websocket 
> handshake), then all successful connections are disposed and an exception is 
> thrown, which can result in either:
>  * {{ConnectionPool}} constructor throws, so the only option is to retry 
> creating the client.
>  * A fire-and-forget task for {{ReplaceDeadConnectionAsync}} fails and the 
> connection pool cannot recover the pool size at that time. The 
> {{GetAvailableConnectionAsync}} retry policy can kick-in here and the pool 
> may eventually recover.
> Gremlin server configurations can include a load-balancer in-front of the 
> server nodes (Azure Cosmos DB is one example). So there are cases where 
> faults with new connections, are transient, but successful/healthy connection 
> attempts can be maintained without destroying them.
> Given this the proposal is the following:
>  # Modify {{FillPoolAsync}} to handle connection failures individually, doing 
> 'best-effort' replacement of missing connections.
>  # If {{FillPoolAsync}} observes an exception, still throw but do not dispose 
> of connection tasks that were successful (and check that state of those 
> successful connections and dispose if they are in error?).
>  # In {{ConnectionPool}} constructor, apply reconnect retry policy to 
> {{ReplaceDeadConnectionAsync()}} invocation
>  ## Retry on any Exception (since there are various IO/Socket exceptions that 
> can be thrown here). This maintains the previous behavior where the pool must 
> be filled in-full at construction, and throws the underlying exception to the 
> caller.
> *Additional note:*
> Post construction, if {{GetAvailableConnection}} triggers a connection 
> replacement fire-and-forget task where the task encounters network 
> IO/handshake error, then these exceptions cannot be observed directly by the 
> caller. Only a {{ServerUnavailableException}} is thrown if the retries are 
> exhausted.
> To allow the underlying errors to be observed by callers, a callback/event 
> handler could be added that is invoked in the 
> {{ReplaceClosedConnectionsAsync}} throws. This would allow for the any 
> exceptions to be observed and collected in some manner.
> Further refinement of this: Have a way of populating 
> {{ServerUnavailableException}} inner exception with the last observed 
> replacement exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to