[ 
https://issues.apache.org/jira/browse/HBASE-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106201#comment-14106201
 ] 

Lars Hofhansl commented on HBASE-9746:
--------------------------------------

So the difference is actually in the ZK client. When you pass a ZK string with 
names that do resolve the ZK client is fine even when nothing is running on 
those hosts (it still successfully creates the Zookeeper object), but when 
something is passed that does not resolve to a hostname at all the Zookeeper 
client fails to create itself.

I found a way to handle this via our RecoverableZookeeper. If the actual 
Zookeeper cannot be created it is left null a throw up 
KeeperExcption.SystemErrorException (i.e. a valid ZookeeperException) and then 
we lazily try to recreate it when needed. Not pretty, but it does fix this 
issue in a fairly natural way.

I have a patch. It's need to some cleaning (we don't *always* want to do this, 
but only for the Replication zookeeper huh-hah), and much more testing. Will 
post later today or tomorrow.

> RegionServer can't start when replication tries to replicate to an unknown 
> host
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-9746
>                 URL: https://issues.apache.org/jira/browse/HBASE-9746
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.12
>            Reporter: Lars Hofhansl
>            Priority: Minor
>             Fix For: 0.99.0, 2.0.0, 0.98.7, 0.94.24
>
>
> Just ran into this:
> {code}
> 13/10/11 00:37:02 [regionserver60020] WARN  zookeeper.ZKConfig(204): 
> java.net.UnknownHostException: <old-host>: Name or service not known
>       at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>       at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:894)
>       at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1286)
>       at java.net.InetAddress.getAllByName0(InetAddress.java:1239)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1155)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1091)
>       at java.net.InetAddress.getByName(InetAddress.java:1041)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:201)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:245)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:127)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.reloadZkWatcher(ReplicationPeer.java:170)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.<init>(ReplicationPeer.java:69)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:343)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:308)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:189)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:156)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.initialize(Replication.java:89)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.newReplicationInstance(HRegionServer.java:3986)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createNewReplicationInstance(HRegionServer.java:3955)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1412)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1096)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:749)
>       at java.lang.Thread.run(Thread.java:722)
> 13/10/11 00:37:02 [regionserver60020] ERROR zookeeper.ZKConfig(210): no valid 
> quorum servers found in zoo.cfg
> 13/10/11 00:37:02 [regionserver60020] WARN  regionserver.HRegionServer(1108): 
> Exception in region server : 
> java.io.IOException: Unable to determine ZooKeeper ensemble
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:116)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:153)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:127)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.reloadZkWatcher(ReplicationPeer.java:170)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.<init>(ReplicationPeer.java:69)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:343)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:308)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:189)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:156)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.initialize(Replication.java:89)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.newReplicationInstance(HRegionServer.java:3986)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createNewReplicationInstance(HRegionServer.java:3955)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1412)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1096)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:749)
>       at java.lang.Thread.run(Thread.java:722)
> 13/10/11 00:37:02 [regionserver60020] INFO  regionserver.HRegionServer(1823): 
> STOPPED: Failed initialization
> 13/10/11 00:37:02 [regionserver60020] ERROR regionserver.HRegionServer(1228): 
> Failed init
> java.io.IOException: Unable to determine ZooKeeper ensemble
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:116)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:153)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:127)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.reloadZkWatcher(ReplicationPeer.java:170)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.<init>(ReplicationPeer.java:69)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:343)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:308)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:189)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:156)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.initialize(Replication.java:89)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.newReplicationInstance(HRegionServer.java:3986)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createNewReplicationInstance(HRegionServer.java:3955)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1412)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1096)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:749)
>       at java.lang.Thread.run(Thread.java:722)
> 13/10/11 00:37:02 [regionserver60020] FATAL regionserver.HRegionServer(1898): 
> ABORTING region server XXXXXXXX,60020,1381451821216: Unhandled exception: 
> Unable to determine ZooKeeper ensemble
> java.io.IOException: Unable to determine ZooKeeper ensemble
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:116)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:153)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:127)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.reloadZkWatcher(ReplicationPeer.java:170)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationPeer.<init>(ReplicationPeer.java:69)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:343)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:308)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:189)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:156)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.initialize(Replication.java:89)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.newReplicationInstance(HRegionServer.java:3986)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createNewReplicationInstance(HRegionServer.java:3955)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1412)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1096)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:749)
>       at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to