[ 
https://issues.apache.org/jira/browse/SOLR-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157644#comment-15157644
 ] 

Dennis Gove commented on SOLR-8599:
-----------------------------------

I have somewhat of an interesting situation at hand here.

As part of this patch a test is added to ConnectionManagerTest which forces a 
DNS failure on the zookeeper connection by attempting to connect to 
"BADADDRESS" and then fixing it after 5 seconds. This shows that the change 
Keith put in ConnectionManager will continually try to make a connection until 
it can. It's a good test and it exercises the bug and fix perfectly.

However, the test depends on my ISP. I've run the test under 5 scenarios and 
only 3 of them pass. 

1. Connected to my corporate network
In this scenario the test passes perfectly as it should.

2. Connected to no network (ie, wifi card turned off)
In this scenario the test passes perfectly as it should.

3. Connected to my home network backed by Verizon FIOS
In this scenario the test hangs and upon further investigation I found that it 
is in an "infinite" loop in ConnectionManager::waitForConnected. This appears 
to be an infinite loop because while there is a timeout the timeout is 
Long.MAX_VALUE. The problem here is that the loop waits until it is either 
connected or closed. Neither of those conditions are ever hit. But why? We're 
trying to hit http://BADADDRESS and clearly that is a DNS lookup failure. Oh no 
no no, not according to Verizon. See, Verizon instead says "Oh, you must've 
typed something in wrong so instead of returning to you a DNS failure let me 
return to you a redirect to a search page - you clearly want this search page". 
It appears that because of this redirection a connection is never made nor is 
it ever closed. Hence, loop forever. 

4. Connected to my personal wifi hotspot backed by T-Mobile
Same issue as seen with Verizon FIOS, though a T-Mobile specific search page. 

5. Connected to a hotspot through my iPhone backed by Verizon Wireless
In this scenario the test passes perfectly as it should.

Note that this difference is *only* seen when a DNS lookup failure is in play. 
If I change the bad address to "http://BADADDRESS"; then it fails instead 
because "//BADADDRESSIS" is said to be an invalid path string. Technically this 
is testing a slightly different case but I'm comfortable calling it the same 
test because the issue being corrected is a failure to make a connection during 
the construction of SolrZooKeeper and a malformed url fails just the same.


> Errors in construction of SolrZooKeeper cause Solr to go into an inconsistent 
> state
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-8599
>                 URL: https://issues.apache.org/jira/browse/SOLR-8599
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Keith Laban
>         Attachments: SOLR-8599.patch, SOLR-8599.patch
>
>
> We originally saw this happen due to a DNS exception (see stack trace below). 
> Although any exception thrown in the constructor of SolrZooKeeper or the 
> parent class, ZooKeeper, will cause DefaultConnectionStrategy to fail to 
> update the zookeeper client. Once it gets into this state, it will not try to 
> connect again until the process is restarted. The node itself will also 
> respond successfully to query requests, but not to update requests.
> Two things should be address here:
> 1) Fix the error handling and issue some number of retries
> 2) If we are stuck in a state like this stop responding to all requests 
> {code}
> 2016-01-23 13:49:20.222 ERROR ConnectionManager [main-EventThread] - 
> :java.net.UnknownHostException: HOSTNAME: unknown error
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at 
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
> at org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41)
> at 
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
> at 
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:132)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2016-01-23 13:49:20.222 INFO ConnectionManager [main-EventThread] - 
> Connected:false
> 2016-01-23 13:49:20.222 INFO ClientCnxn [main-EventThread] - EventThread shut 
> down
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to