We don’t currently retry, but I don’t think it would hurt much if we did - at 
least briefly.

If you want to file a JIRA issue, that would be the best way to get it in a 
future release.

-- 
Mark Miller
about.me/markrmiller

On March 28, 2014 at 5:40:47 PM, Michael Della Bitta 
(michael.della.bi...@appinions.com) wrote:

Hi, Jessica,  

We've had a similar problem when DNS resolution of our Hadoop task nodes  
has failed. They tend to take a dirt nap until you fix the problem  
manually. Are you experiencing this in AWS as well?  

I'd say the two things to do are to poll the node state via HTTP using a  
monitoring tool so you get an immediate notification of the problem, and to  
install some sort of caching server like nscd if you expect to have DNS  
resolution failures regularly.  



Michael Della Bitta  

Applications Developer  

o: +1 646 532 3062  

appinions inc.  

"The Science of Influence Marketing"  

18 East 41st Street  

New York, NY 10017  

t: @appinions <https://twitter.com/Appinions> | g+:  
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
  
w: appinions.com <http://www.appinions.com/>  


On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet <mewmewb...@gmail.com>wrote:  

> Hi,  
>  
> First off, I'd like to give a disclaimer that this probably is a very edge  
> case issue. However, since it happened to us, I would like to get some  
> advice on how to best handle this failure scenario.  
>  
> Basically, we had some network issue where we temporarily lost connection  
> and DNS. The zookeeper client properly triggered the watcher. However, when  
> trying to reconnect, this following Exception is thrown:  
>  
> 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line  
> 121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or  
> service not known  
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)  
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)  
> at  
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)  
> at java.net.InetAddress.getAllByName0(InetAddress.java:1211)  
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)  
> at java.net.InetAddress.getAllByName(InetAddress.java:1063)  
> at  
>  
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
>   
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)  
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)  
> at  
> org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41)  
> at  
>  
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
>   
> at  
>  
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
>   
> at  
>  
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) 
>  
> at  
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)  
>  
> I tried to look at the code and it seems that there'd be no further retries  
> to connect to Zookeeper, and the node is basically left in a bad state and  
> will not recover on its own. (Please correct me if I'm reading this wrong.)  
> Thinking about it, this is probably fair, since normally you wouldn't  
> expect retries to fix an "unknown host" issue--even though in our case it  
> would have--but I'm wondering what we should do to handle this situation if  
> it happens again in the future.  
>  
> Any advice is appreciated.  
>  
> Thanks,  
> Jessica  
>  

Reply via email to