[
https://issues.apache.org/jira/browse/CURATOR-229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138245#comment-16138245
]
IonuČ› G. Stan edited comment on CURATOR-229 at 8/23/17 11:37 AM:
-----------------------------------------------------------------
We've bumped into the same issue. Our DNS server was temporarily down and
Curator stopped retrying to connect because ZooKepper threw a non-retryable
exception: ({{UnknownHostException}}). In ZooKepper >= 3.4.11 it will throw an
{{IllegalArgumentException}}. This behaviour has changed as a result of these:
- https://issues.apache.org/jira/browse/ZOOKEEPER-1576
- https://issues.apache.org/jira/browse/ZOOKEEPER-2614
What we've ended up doing is to register a custom {{ZookeeperFactory}} with
{{CuratorFrameworkFactory.builder()}}. That factory is responsible for creating
new {{ZooKeeper}} instances when retrying. So we're just catching
{{UnknownHostException}} and {{IllegalArgumentException}} there and then throw
a {{ConnectionLossException}}, which is retry-able as far as Curator is
concerned.
In case anyone's interested, here's the code, in Scala:
{code:title=ZookeeperFactory.scala|borderStyle=solid}
import java.net.UnknownHostException
import com.typesafe.scalalogging.LazyLogging
import org.apache.zookeeper.KeeperException.ConnectionLossException
import org.apache.zookeeper.{Watcher, ZooKeeper}
/** ZooKeeper client factory that's resilient to hostname lookup errors.
*
* The purpose of this wrapper is to handle hostname errors encountered
* while creating ZooKeeper client instances. It works around these issues:
*
* - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
* - https://issues.apache.org/jira/browse/ZOOKEEPER-2614
* - https://issues.apache.org/jira/browse/CURATOR-229
*
* Curator knows how to retry a finite and predefined set of exceptions. What
* this custom factory does is to map hostname-related exceptions into one
* that Curator interprets as a retry-able exception. So it will keep trying
* to establish a connection to ZooKeeper even in the face of such errors.
*
* @param servers The list of ZooKeeper hostnames or addresses.
*/
class ZookeeperFactory(servers: Seq[String])
extends org.apache.curator.utils.ZookeeperFactory
with LazyLogging {
override def newZooKeeper(connectString: String, sessionTimeout: Int,
watcher: Watcher, canBeReadOnly: Boolean): ZooKeeper = {
def retry(servers: Seq[String]): ZooKeeper = {
servers match {
case Nil =>
// All server hostnames have failed. Tell Curator to retry later.
throw new ConnectionLossException()
case remainingServers =>
val connectString = remainingServers.mkString(",")
try {
new ZooKeeper(connectString, sessionTimeout, watcher, canBeReadOnly)
} catch {
// Apache ZooKeeper <= 3.4.10 will throw an UnknownHostException at
// the first hostname which it can't resolve, instead of trying the
// following hostnames in the list. So, we just drop the offending
// hostnames from the servers list and try again.
case e: UnknownHostException =>
logger.warn(s"ZooKeeper client creation failed for server list:
$connectString", e)
retry(remainingServers.drop(1))
// Apache ZooKeeper >= 3.4.11, will try all hostnames, but we still
// want to retry if all of them fail right now.
case EmptyHostProvider(e) =>
logger.warn(s"ZooKeeper client creation failed for server list:
$connectString", e)
throw new ConnectionLossException()
}
}
}
retry(servers)
}
}
object EmptyHostProvider {
private final val MESSAGE = "A HostProvider may not be empty!"
def unapply(e: Throwable): Option[IllegalArgumentException] = {
e match {
case e: IllegalArgumentException if e.getMessage == MESSAGE => Some(e)
case _ => None
}
}
}
{code}
And its usage:
{code}
val zk = CuratorFrameworkFactory.builder()
.connectString(config.servers)
.sessionTimeoutMs(...)
.connectionTimeoutMs(...)
.zookeeperFactory(new ZookeeperFactory(config.servers.split(',')))
.retryPolicy(new RetryForever(1000))
.build()
{code}
was (Author: igstan):
We've bumped into the same issue. Our DNS server was temporarily down and
Curator stopped retrying to connect because ZooKepper threw a non-retryable
exception: ({{UnknownHostException}}). In ZooKepper >= 3.4.11 it will throw an
{{IllegalArgumentException}}. This behaviour has changed as a result of these:
- https://issues.apache.org/jira/browse/ZOOKEEPER-1576
- https://issues.apache.org/jira/browse/ZOOKEEPER-2614
What we've ended up doing is to register a custom {{ZookeeperFactory}} with
{{CuratorFrameworkFactory.builder()}}. That factory is responsible for creating
new {{ZooKeeper}} instance when retrying. So we're just catching
{{UnknownHostException}} and {{IllegalArgumentException}} there and then throw
a {{ConnectionLossException}}, which is retry-able as far as Curator is
concerned.
In case anyone's interested, here's the code, in Scala:
{code:title=ZookeeperFactory.scala|borderStyle=solid}
import java.net.UnknownHostException
import com.typesafe.scalalogging.LazyLogging
import org.apache.zookeeper.KeeperException.ConnectionLossException
import org.apache.zookeeper.{Watcher, ZooKeeper}
/** ZooKeeper client factory that's resilient to hostname lookup errors.
*
* The purpose of this wrapper is to handle hostname errors encountered
* while creating ZooKeeper client instances. It works around these issues:
*
* - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
* - https://issues.apache.org/jira/browse/ZOOKEEPER-2614
* - https://issues.apache.org/jira/browse/CURATOR-229
*
* Curator knows how to retry a finite and predefined set of exceptions. What
* this custom factory does is to map hostname-related exceptions into one
* that Curator interprets as a retry-able exception. So it will keep trying
* to establish a connection to ZooKeeper even in the face of such errors.
*
* @param servers The list of ZooKeeper hostnames or addresses.
*/
class ZookeeperFactory(servers: Seq[String])
extends org.apache.curator.utils.ZookeeperFactory
with LazyLogging {
override def newZooKeeper(connectString: String, sessionTimeout: Int,
watcher: Watcher, canBeReadOnly: Boolean): ZooKeeper = {
def retry(servers: Seq[String]): ZooKeeper = {
servers match {
case Nil =>
// All server hostnames have failed. Tell Curator to retry later.
throw new ConnectionLossException()
case remainingServers =>
val connectString = remainingServers.mkString(",")
try {
new ZooKeeper(connectString, sessionTimeout, watcher, canBeReadOnly)
} catch {
// Apache ZooKeeper <= 3.4.10 will throw an UnknownHostException at
// the first hostname which it can't resolve, instead of trying the
// following hostnames in the list. So, we just drop the offending
// hostnames from the servers list and try again.
case e: UnknownHostException =>
logger.warn(s"ZooKeeper client creation failed for server list:
$connectString", e)
retry(remainingServers.drop(1))
// Apache ZooKeeper >= 3.4.11, will try all hostnames, but we still
// want to retry if all of them fail right now.
case EmptyHostProvider(e) =>
logger.warn(s"ZooKeeper client creation failed for server list:
$connectString", e)
throw new ConnectionLossException()
}
}
}
retry(servers)
}
}
object EmptyHostProvider {
private final val MESSAGE = "A HostProvider may not be empty!"
def unapply(e: Throwable): Option[IllegalArgumentException] = {
e match {
case e: IllegalArgumentException if e.getMessage == MESSAGE => Some(e)
case _ => None
}
}
}
{code}
And its usage:
{code}
val zk = CuratorFrameworkFactory.builder()
.connectString(config.servers)
.sessionTimeoutMs(...)
.connectionTimeoutMs(...)
.zookeeperFactory(new ZookeeperFactory(config.servers.split(',')))
.retryPolicy(new RetryForever(1000))
.build()
{code}
> No retry on DNS lookup failure
> ------------------------------
>
> Key: CURATOR-229
> URL: https://issues.apache.org/jira/browse/CURATOR-229
> Project: Apache Curator
> Issue Type: Bug
> Components: Framework
> Affects Versions: 2.7.0
> Reporter: Michael Putters
>
> Our environment is setup so that host names (rather than IP addresses) are
> used when registering services.
> When disconnecting a node from the network, it will attempt to reconnect and
> - in order to do this - attempts to resolve a host name, which fails (since
> we have no network connectivity and a DNS server is used).
> It appears this type of exception is not retryable, and the node simply gives
> up and never reconnects, even when the network connectivity is back.
> Is this the expected behavior? Is there any way to configure Curator so that
> this type of exception is retryable? I had a look at
> {{CuratorFrameworkImpl.java}} around line 768 but there doesn't seem to be
> anything configurable.
> If this is not the expected behavior (or if it is but you don't mind making
> it configurable), I should be able to provide a patch via a pull request.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)