[ 
https://issues.apache.org/jira/browse/GEODE-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523945#comment-17523945
 ] 

Patrick Johnsn commented on GEODE-9880:
---------------------------------------

Some details to reproduce:

1. GemfirePoolManager creates org.apache.geode.cache.client.PoolFactory
2. GemfirePoolManager sets locators by calling poolFactory.addLocator(). This 
sets IP ADDRESS-es of locators
3. Then, among other things, there is a process done over and over internally 
by geode cache instance that sends a query to configured locators to 
dynamically update the list of locators
3.1 Before actually sending the request to a locator, the hostname will be 
found, based on the ip address, and stored as part of the configured locators 
list.
3.2 When the request-response is done, the response will be a currently active 
list of locators, but this will include only HOST NAMES.
3.3 The configured list of locators will be merged with the currently active 
locators got from the response.
Then 3.1 will be repeated again.

The merging done in step 3.3 tries to extend the initially configured locators 
list, but skips the matches BASED ON THE HOSTNAME.

If prior to sending the request the 3.1 was done, then the merging step has no 
practical effect, because the hostname in the initial list and in the response 
matches.
This is the happy path that happens naturally when there is just one locator 
configured - i.e. for a non HA setup.

But for HA setup, when there are two locators, 3.1-3.3 flow completes with the 
first locator, while the second locator's configuration contains only the IP 
ADDRESS, since the step 3.2 was not done yet for that entry.

At this point, the first locator response contains the HOSTNAME of the second 
locator, and this name is added to the configured locators list, without 
understanding that it is already there in the list.
So we end up with two entries for the second locator - one with IP address only 
and another with hostname only.

After this, whenever there is a need to contact a locator, unluckily, the entry 
with only the hostname is picked as the first preferred locator to be 
contacted, but this time, the ip address will not be auto-generated for the 
entry, instead a null reference of type InetAddress will be obtained and 
attempted to use in a code which has appeared in 1.12.5 to do stricter TLS 
handshake. This is where we get the null pointer exception, and the whole 
communication with the locators gets blocked. Other configured locators are no 
more contacted because of the exception.

The good news is that, probably by some configuration, the hostname that is 
obtained from the ip address is not actually involved strictly in TLS SNI 
process. This would be a problem since the certificates that we have configured 
have common name (CN) set to fixed value such as "common-name" and there is no 
subject alternative name (SAN).

> Cluster with multiple locators in an environment with no host name 
> resolution, leads to null pointer exception
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-9880
>                 URL: https://issues.apache.org/jira/browse/GEODE-9880
>             Project: Geode
>          Issue Type: Bug
>          Components: locator, membership
>    Affects Versions: 1.12.5
>            Reporter: Tigran Ghahramanyan
>            Assignee: Patrick Johnsn
>            Priority: Major
>              Labels: blocks-1.12.10, blocks-1.15.0, membership
>
> In our use case we have two locators that are initially configured with IP 
> addresses, but _AutoConnectionSourceImpl.UpdateLocatorList()_ flow keeps on 
> adding their corresponding host names to the locators list, while these host 
> names are not resolvable.
> Later in {_}AutoConnectionSourceImpl.queryLocators(){_}, whenever a client 
> tries to use such non resolvable host name to connect to a locator it tries 
> to establish a connection to {_}socketaddr=0.0.0.0{_}, as written in 
> {_}SocketCreator.connect(){_}. Which seems strange.
> Then, if there is no locator running on the same host, the next locator in 
> the list is contacted, until reaching a locator contact configured with IP 
> address - which succeeds eventually.
> But, when there happens to be a locator listening on the same host, then we 
> have a null pointer exception in the second line below, because _inetadd=null_
> _socket.connect(sockaddr, Math.max(timeout, 0)); // sockaddr=0.0.0.0, 
> connects to a locator listening on the same host_
> _configureClientSSLSocket(socket, inetadd.getHostName(), timeout); // inetadd 
> = null_
>  
> As a result, the cluster comes to a failed state, unable to recover.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to