[ https://issues.apache.org/jira/browse/HBASE-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982420#action_12982420 ]
stack commented on HBASE-3445: ------------------------------ James: In the AssignmentManager, where we go to RPC to a remote regionserver, we do following: {code} } catch (ConnectException e) { LOG.info("Failed connect to " + server + ", message=" + e.getMessage() + ", region=" + region.getEncodedName()); // Presume that regionserver just failed and we haven't got expired // server from zk yet. Let expired server deal with clean up. } catch (java.net.SocketTimeoutException e) { LOG.info("Server " + server + " returned " + e.getMessage() + " for " + region.getEncodedName()); // Presume retry or server will expire. } catch (EOFException e) { LOG.info("Server " + server + " returned " + e.getMessage() + " for " + region.getEncodedName()); // Presume retry or server will expire. } catch (RemoteException re) { IOException ioe = re.unwrapRemoteException(); if (ioe instanceof NotServingRegionException) { // Failed to close, so pass through and reassign LOG.debug("Server " + server + " returned " + ioe + " for " + region.getEncodedName()); } else if (ioe instanceof EOFException) { // Failed to close, so pass through and reassign LOG.debug("Server " + server + " returned " + ioe + " for " + region.getEncodedName()); } else { this.master.abort("Remote unexpected exception", ioe); } } catch (Throwable t) { {code} I think your adding of timeout to the try/catch in the getCachedConnection is right. Maybe we should add the ConnectException too? Unless you object, I'll add it when I commit your patch. > Master crashes on data that was moved from different host > --------------------------------------------------------- > > Key: HBASE-3445 > URL: https://issues.apache.org/jira/browse/HBASE-3445 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.0 > Reporter: James Kennedy > Priority: Critical > Fix For: 0.90.0 > > Attachments: 3445_0.90.0.patch > > > While testing an upgrade to 0.90.0 RC3 I noticed that if I seeded our test > data on one machine and transferred to another machine the HMaster on the new > machine dies on startup. > Based on the following stack trace it looks as though it is attempting to > find the .meta region with the ip address of the original machine. Instead > of waiting around for RegionServer's to register with new location data, > HMaster throws it's hands up with a FATAL exception. > Note that deleting the zookeeper dir makes no difference. > Also note that so far I have only reproduced this in my own environment using > the hbase-trx extension of HBase and an ApplicationStarter that starts the > Master and RegionServer together in the same JVM. While the issue seems > likely isolated from those factors it is far from a vanilla HBase environment. > I will spend some time trying to reproduce the issue in a proper hbase test. > But perhaps someone can beat me to it? How do I simulate the IP switch? May > require a data.tar upload. > [14/01/11 10:45:20] 6396 [ Thread-298] ERROR > server.quorum.QuorumPeerConfig - Invalid configuration, only one server > specified (ignoring) > [14/01/11 10:45:21] 7178 [ main] INFO > ion.service.HBaseRegionService - troove> region port: 60010 > [14/01/11 10:45:21] 7180 [ main] INFO > ion.service.HBaseRegionService - troove> region interface: > org.apache.hadoop.hbase.ipc.IndexedRegionInterface > [14/01/11 10:45:21] 7180 [ main] INFO > ion.service.HBaseRegionService - troove> root dir: > hdfs://localhost:8701/hbase > [14/01/11 10:45:21] 7180 [ main] INFO > ion.service.HBaseRegionService - troove> Initializing region server. > [14/01/11 10:45:21] 7631 [ main] INFO > ion.service.HBaseRegionService - troove> Starting region server thread. > [14/01/11 10:46:54] 100764 [ HMaster] FATAL > he.hadoop.hbase.master.HMaster - Unhandled exception. Starting shutdown. > java.net.SocketTimeoutException: 20000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending > remote=192.168.1.102/192.168.1.102:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:311) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:865) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258) > at $Proxy14.getProtocolVersion(Unknown Source) > at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419) > at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393) > at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444) > at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:283) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:478) > at > org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435) > at > org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.