[
https://issues.apache.org/jira/browse/GEODE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Donal Evans reassigned GEODE-9808:
----------------------------------
Assignee: Donal Evans
> Client ops fail with NoLocatorsAvailableException when all servers leave the
> DS
> --------------------------------------------------------------------------------
>
> Key: GEODE-9808
> URL: https://issues.apache.org/jira/browse/GEODE-9808
> Project: Geode
> Issue Type: Bug
> Components: client/server
> Affects Versions: 1.15.0
> Reporter: Bill Burcham
> Assignee: Donal Evans
> Priority: Major
> Labels: needsTriage
>
> When there are no cache servers (only locators) in a cluster, client
> operations will fail with a misleading exception:
> {noformat}
> org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect
> to any locators in the list
> [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
>
> gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
>
> gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334]
> at
> org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174)
> at
> org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211)
> at
> org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196)
> at
> org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227)
> at
> org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365)
> at
> org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161)
> at
> org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120)
> at
> org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805)
> at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
> {noformat}
> Even the client is able to connect to a locator, we encounter a
> NoAvailableLocatorsException exception with the message "Unable to connect to
> any locators in the list".
> Investigating the product code we see:
> # If there are no cache servers in the cluster, ServerLocator.pickServer()
> will definitely construct a ClientConnectionResponse(null) which causes that
> object’s hasResult() to respond with false in the loop termination in
> AutoConnectionSourceImpl.queryLocators()
> # Not only is the exception wording misleading in
> AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two
> other calling locations in AutoConnectionSourceImpl: findReplacementServer()
> and findServersForQueue().
> # In each of those cases the calling method translates a null response from
> queryLocators() into a throw of a NoAvailableLocatorsException
> # an appropriate exception, NoAvailableServersException, already exists, for
> the case where we were able to contact a locator but the locator was not able
> to find any cache servers
> # According to my Git spelunking queryLocators() has been obfuscating the
> true cause of the failure since at least 2015
> Without analyzing ServerLocator.pickServer()
> (LocatorLoadSnapshot.getServerForConnection()) to discern why two locators
> might disagree on how many cache servers are in the cluster, it seems to me
> that we should modify AutoConnectionSourceImpl.queryLocators() so that:
> * if it gets a ServerLocationResponse with hasResult() true, it immediately
> returns that as it does now
> * otherwise it keeps trying and it keeps track of the last (non-null)
> ServerLocationResponse it has received
> * it returns the last non-null ServerLocationResponse it received (otherwise
> it returns null)
> With that in hand, we can change the three call locations in
> AutoConnectionSourceImpl: findServer(), findReplacementServer(), and
> findServersForQueue() to each throw NoAvailableLocatorsException if no
> locator responded, or NoAvailableServersException if a locator responded with
> a ClientConnectionResponse for which hasResult() returns null.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)