[ https://issues.apache.org/jira/browse/GEODE-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523945#comment-17523945 ]
Patrick Johnsn commented on GEODE-9880: --------------------------------------- Some details to reproduce: 1. GemfirePoolManager creates org.apache.geode.cache.client.PoolFactory 2. GemfirePoolManager sets locators by calling poolFactory.addLocator(). This sets IP ADDRESS-es of locators 3. Then, among other things, there is a process done over and over internally by geode cache instance that sends a query to configured locators to dynamically update the list of locators 3.1 Before actually sending the request to a locator, the hostname will be found, based on the ip address, and stored as part of the configured locators list. 3.2 When the request-response is done, the response will be a currently active list of locators, but this will include only HOST NAMES. 3.3 The configured list of locators will be merged with the currently active locators got from the response. Then 3.1 will be repeated again. The merging done in step 3.3 tries to extend the initially configured locators list, but skips the matches BASED ON THE HOSTNAME. If prior to sending the request the 3.1 was done, then the merging step has no practical effect, because the hostname in the initial list and in the response matches. This is the happy path that happens naturally when there is just one locator configured - i.e. for a non HA setup. But for HA setup, when there are two locators, 3.1-3.3 flow completes with the first locator, while the second locator's configuration contains only the IP ADDRESS, since the step 3.2 was not done yet for that entry. At this point, the first locator response contains the HOSTNAME of the second locator, and this name is added to the configured locators list, without understanding that it is already there in the list. So we end up with two entries for the second locator - one with IP address only and another with hostname only. After this, whenever there is a need to contact a locator, unluckily, the entry with only the hostname is picked as the first preferred locator to be contacted, but this time, the ip address will not be auto-generated for the entry, instead a null reference of type InetAddress will be obtained and attempted to use in a code which has appeared in 1.12.5 to do stricter TLS handshake. This is where we get the null pointer exception, and the whole communication with the locators gets blocked. Other configured locators are no more contacted because of the exception. The good news is that, probably by some configuration, the hostname that is obtained from the ip address is not actually involved strictly in TLS SNI process. This would be a problem since the certificates that we have configured have common name (CN) set to fixed value such as "common-name" and there is no subject alternative name (SAN). > Cluster with multiple locators in an environment with no host name > resolution, leads to null pointer exception > -------------------------------------------------------------------------------------------------------------- > > Key: GEODE-9880 > URL: https://issues.apache.org/jira/browse/GEODE-9880 > Project: Geode > Issue Type: Bug > Components: locator, membership > Affects Versions: 1.12.5 > Reporter: Tigran Ghahramanyan > Assignee: Patrick Johnsn > Priority: Major > Labels: blocks-1.12.10, blocks-1.15.0, membership > > In our use case we have two locators that are initially configured with IP > addresses, but _AutoConnectionSourceImpl.UpdateLocatorList()_ flow keeps on > adding their corresponding host names to the locators list, while these host > names are not resolvable. > Later in {_}AutoConnectionSourceImpl.queryLocators(){_}, whenever a client > tries to use such non resolvable host name to connect to a locator it tries > to establish a connection to {_}socketaddr=0.0.0.0{_}, as written in > {_}SocketCreator.connect(){_}. Which seems strange. > Then, if there is no locator running on the same host, the next locator in > the list is contacted, until reaching a locator contact configured with IP > address - which succeeds eventually. > But, when there happens to be a locator listening on the same host, then we > have a null pointer exception in the second line below, because _inetadd=null_ > _socket.connect(sockaddr, Math.max(timeout, 0)); // sockaddr=0.0.0.0, > connects to a locator listening on the same host_ > _configureClientSSLSocket(socket, inetadd.getHostName(), timeout); // inetadd > = null_ > > As a result, the cluster comes to a failed state, unable to recover. -- This message was sent by Atlassian Jira (v8.20.1#820001)