Will do Mark. Thanks!

On Sun, Mar 30, 2014 at 1:29 PM, Mark Miller <markrmil...@gmail.com> wrote:

> We don't currently retry, but I don't think it would hurt much if we did -
> at least briefly.
>
> If you want to file a JIRA issue, that would be the best way to get it in
> a future release.
>
> --
> Mark Miller
> about.me/markrmiller
>
> On March 28, 2014 at 5:40:47 PM, Michael Della Bitta (
> michael.della.bi...@appinions.com) wrote:
>
> Hi, Jessica,
>
> We've had a similar problem when DNS resolution of our Hadoop task nodes
> has failed. They tend to take a dirt nap until you fix the problem
> manually. Are you experiencing this in AWS as well?
>
> I'd say the two things to do are to poll the node state via HTTP using a
> monitoring tool so you get an immediate notification of the problem, and to
> install some sort of caching server like nscd if you expect to have DNS
> resolution failures regularly.
>
>
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> "The Science of Influence Marketing"
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions<
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet <mewmewb...@gmail.com
> >wrote:
>
> > Hi,
> >
> > First off, I'd like to give a disclaimer that this probably is a very
> edge
> > case issue. However, since it happened to us, I would like to get some
> > advice on how to best handle this failure scenario.
> >
> > Basically, we had some network issue where we temporarily lost connection
> > and DNS. The zookeeper client properly triggered the watcher. However,
> when
> > trying to reconnect, this following Exception is thrown:
> >
> > 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line
> > 121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or
> > service not known
> > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
> > at
> > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
> > at java.net.InetAddress.getAllByName0(InetAddress.java:1211)
> > at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> > at java.net.InetAddress.getAllByName(InetAddress.java:1063)
> > at
> >
> >
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
> > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
> > at
> > org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41)
> > at
> >
> >
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
> > at
> >
> >
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
> > at
> >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
> > at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> >
> > I tried to look at the code and it seems that there'd be no further
> retries
> > to connect to Zookeeper, and the node is basically left in a bad state
> and
> > will not recover on its own. (Please correct me if I'm reading this
> wrong.)
> > Thinking about it, this is probably fair, since normally you wouldn't
> > expect retries to fix an "unknown host" issue--even though in our case it
> > would have--but I'm wondering what we should do to handle this situation
> if
> > it happens again in the future.
> >
> > Any advice is appreciated.
> >
> > Thanks,
> > Jessica
> >
>

Reply via email to