Will do Mark. Thanks!
On Sun, Mar 30, 2014 at 1:29 PM, Mark Miller <markrmil...@gmail.com> wrote: > We don't currently retry, but I don't think it would hurt much if we did - > at least briefly. > > If you want to file a JIRA issue, that would be the best way to get it in > a future release. > > -- > Mark Miller > about.me/markrmiller > > On March 28, 2014 at 5:40:47 PM, Michael Della Bitta ( > michael.della.bi...@appinions.com) wrote: > > Hi, Jessica, > > We've had a similar problem when DNS resolution of our Hadoop task nodes > has failed. They tend to take a dirt nap until you fix the problem > manually. Are you experiencing this in AWS as well? > > I'd say the two things to do are to poll the node state via HTTP using a > monitoring tool so you get an immediate notification of the problem, and to > install some sort of caching server like nscd if you expect to have DNS > resolution failures regularly. > > > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > "The Science of Influence Marketing" > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions< > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > w: appinions.com <http://www.appinions.com/> > > > On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet <mewmewb...@gmail.com > >wrote: > > > Hi, > > > > First off, I'd like to give a disclaimer that this probably is a very > edge > > case issue. However, since it happened to us, I would like to get some > > advice on how to best handle this failure scenario. > > > > Basically, we had some network issue where we temporarily lost connection > > and DNS. The zookeeper client properly triggered the watcher. However, > when > > trying to reconnect, this following Exception is thrown: > > > > 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line > > 121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or > > service not known > > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866) > > at > > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258) > > at java.net.InetAddress.getAllByName0(InetAddress.java:1211) > > at java.net.InetAddress.getAllByName(InetAddress.java:1127) > > at java.net.InetAddress.getAllByName(InetAddress.java:1063) > > at > > > > > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60) > > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) > > at > > org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41) > > at > > > > > org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53) > > at > > > > > org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147) > > at > > > > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > > at > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > > > I tried to look at the code and it seems that there'd be no further > retries > > to connect to Zookeeper, and the node is basically left in a bad state > and > > will not recover on its own. (Please correct me if I'm reading this > wrong.) > > Thinking about it, this is probably fair, since normally you wouldn't > > expect retries to fix an "unknown host" issue--even though in our case it > > would have--but I'm wondering what we should do to handle this situation > if > > it happens again in the future. > > > > Any advice is appreciated. > > > > Thanks, > > Jessica > > >