[ https://issues.apache.org/jira/browse/TINKERPOP-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922392#comment-16922392 ]
Florian Hockmann commented on TINKERPOP-2288: --------------------------------------------- This seems to be a problem many users of Cosmos DB have with Gremlin.Net which we thought we had already fixed before (see TINKERPOP-2090). [~pathuot]: {quote}As a work around, Is there a way we can access this information from the code so that I can catch those scenario and create logic that re-initiate the connection pool? {quote} You can't directly access this information. What you can do as a work around right now is catching the exception and then either retrying the request or if it fails too often, disposing the {{GremlinClient}} and then creating a new one which will also create new connections. [~samimajed]: {quote}Yes, it does seem like gremlinConnection.Client.NrConnections is the only information that is accessible to log to a consumer of the library. Is that correct? {quote} That's correct. {quote}It would be great to have the following data available: * Current open web socket connections in pool (or remaining available) * Current number of in-flight requests in whichever web socket that's being used for my query{quote} Since we currently don't use these values for anything in the pool, we just don't have them. So, we would need to compute them on the fly by iterating over all connections. I'm not sure whether it's really a good idea to provide this information if we don't already have it available as users are probably not aware of the cost to compute it. It would also only allow for workarounds for this problem that we should ultimately better solve in the driver itself, right? Another step to make the state of the connection pool more visible to users would be to add logging to the driver. We could then log each time we detect a dead connection or if a connection reaches its max in process limit for example. [~jmondal]: Thanks for taking the time to investigate this! {quote}Users need to catch these and move forward to request new connections and wait for the connection pool to be ultimately populated. {quote} Yes, that's the workaround I would recommend right now for users of Cosmos DB. {quote}Perhaps Gremlin .NET needs to find a way to contact the server before throwing a ServerUnavailableException() exception. {quote} Yes, Cosmos DB closing idle connections (while apparently ignoring WS keep alive pings) is a good reason for us to try replacing closed connections before throwing an exception. This can of course also help in other scenarios where the server was just temporarily not reachable but can be reached again when the request comes in. So, this is something we should implement in general and not just for this case of Cosmos DB closing idle connections. I see three alternatives to implement this: *Option 1*: {{EnsurePoolIsPopulatedAsync()}} iterates over all connections to check whether they are open & replaces closed connections * Advantage: Logic to populate the pool is kept in one place * Disadvantage: Makes each request more expensive as the driver has to iterate over all connections *Option 2*: {{TryGetAvailableConnection()}} replaces closed connections directly * Advantage: Time between check and usage is very short -> race condition unlikely * Disadvantage: Slows down requests even if another connection is available for the request *Option 3*: Background tasks that regularly check all connections and replace closed ones * Advantage: Latency of requests (basically) unaffected * Advantage: Connections can be replaced whenever the server is available again * Advantage: We can potentially move the population of the pool and the removal of closed connections out of the normal request processing completely into a background task. * Disadvantage: Highest complexity * Disadvantage: Time between closing of a connection and the next check in which requests will still fail. This is however not that much of a problem for an idle connection as we can probably replace the dead connection before new requests arrive. Option 1 is the one you suggested, [~jmondal], if I understood it correctly. I tend overall however to option 3 as it doesn't impact the latency of requests and because it allows us to remove the pool resizing operations out of the usual request processing which also means that we can prevent a situation from happening where the pool is sized down and up at the same time. [~jorgebg] already [suggested to use a task scheduler|https://github.com/apache/tinkerpop/pull/1077#issuecomment-469640764] for this reason in the PR that introduced round robin scheduling of connections. What do others think about this? > Get ConnectionPoolBusyException and then ServerUnavailableExceptions > -------------------------------------------------------------------- > > Key: TINKERPOP-2288 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2288 > Project: TinkerPop > Issue Type: Bug > Components: dotnet > Affects Versions: 3.4.3 > Environment: Gremlin.Net 3.4.3 > Microsoft.NetCore.App 2.2 > Azure Cosmos DB > Reporter: patrice huot > Priority: Critical > > I am using .Net core Gremlin API query Cosmos DB. > From time to time we are getting an error saying that no connection is > available and then the server become unavailable. When this is occurring we > need to restart the server. It looks like the connections are not released > properly and become unavailable forever. > We have configured the pool size to 50 and the MaxInProcessPerConnection to > 32 (Which I guess should be sufficient). > To diagnose the issue, Is there a way to access diagnostic information on the > connection pool in order to know how many connections are open and how many > processes are running in each connection? > I would like to be able to monitor the connections usage to see if they are > about to be exhausted and to see if the number of used connections is always > increasing or of the connection lease is release when the queries completes? > As a work around, Is there a way we can access this information from the code > so that I can catch those scenario and create logic that re-initiate the > connection pool? > > -- This message was sent by Atlassian Jira (v8.3.2#803003)