[ 
https://issues.apache.org/jira/browse/HBASE-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982420#action_12982420
 ] 

stack commented on HBASE-3445:
------------------------------

James:

In the AssignmentManager, where we go to RPC to a remote regionserver, we do 
following:

{code}
    } catch (ConnectException e) {
      LOG.info("Failed connect to " + server + ", message=" + e.getMessage() +
        ", region=" + region.getEncodedName());
      // Presume that regionserver just failed and we haven't got expired
      // server from zk yet.  Let expired server deal with clean up.
    } catch (java.net.SocketTimeoutException e) {
      LOG.info("Server " + server + " returned " + e.getMessage() + " for " +
        region.getEncodedName());
      // Presume retry or server will expire.
    } catch (EOFException e) {
      LOG.info("Server " + server + " returned " + e.getMessage() + " for " +
        region.getEncodedName());
      // Presume retry or server will expire.
    } catch (RemoteException re) {
      IOException ioe = re.unwrapRemoteException();
      if (ioe instanceof NotServingRegionException) {
        // Failed to close, so pass through and reassign
        LOG.debug("Server " + server + " returned " + ioe + " for " +
          region.getEncodedName());
      } else if (ioe instanceof EOFException) {
        // Failed to close, so pass through and reassign
        LOG.debug("Server " + server + " returned " + ioe + " for " +
          region.getEncodedName());
      } else {
        this.master.abort("Remote unexpected exception", ioe);
      }
    } catch (Throwable t) {
{code}

I think your adding of timeout to the try/catch in the getCachedConnection is 
right.  Maybe we should add the ConnectException too? Unless you object, I'll 
add it when I commit your patch.

> Master crashes on data that was moved from different host
> ---------------------------------------------------------
>
>                 Key: HBASE-3445
>                 URL: https://issues.apache.org/jira/browse/HBASE-3445
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: James Kennedy
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 3445_0.90.0.patch
>
>
> While testing an upgrade to 0.90.0 RC3 I noticed that if I seeded our test 
> data on one machine and transferred to another machine the HMaster on the new 
> machine dies on startup.
> Based on the following stack trace it looks as though it is attempting to 
> find the .meta region with the ip address of the original machine.  Instead 
> of waiting around for RegionServer's to register with new location data, 
> HMaster throws it's hands up with a FATAL exception.
> Note that deleting the zookeeper dir makes no difference.
> Also note that so far I have only reproduced this in my own environment using 
> the hbase-trx extension of HBase and an ApplicationStarter that starts the 
> Master and RegionServer together in the same JVM.  While the issue seems 
> likely isolated from those factors it is far from a vanilla HBase environment.
> I will spend some time trying to reproduce the issue in a proper hbase test.  
> But perhaps someone can beat me to it?  How do I simulate the IP switch? May 
> require a data.tar upload. 
> [14/01/11 10:45:20] 6396   [     Thread-298] ERROR 
> server.quorum.QuorumPeerConfig  - Invalid configuration, only one server 
> specified (ignoring)
> [14/01/11 10:45:21] 7178   [           main] INFO  
> ion.service.HBaseRegionService  - troove> region port:       60010
> [14/01/11 10:45:21] 7180   [           main] INFO  
> ion.service.HBaseRegionService  - troove> region interface:  
> org.apache.hadoop.hbase.ipc.IndexedRegionInterface
> [14/01/11 10:45:21] 7180   [           main] INFO  
> ion.service.HBaseRegionService  - troove> root dir: 
> hdfs://localhost:8701/hbase
> [14/01/11 10:45:21] 7180   [           main] INFO  
> ion.service.HBaseRegionService  - troove> Initializing region server.
> [14/01/11 10:45:21] 7631   [           main] INFO  
> ion.service.HBaseRegionService  - troove> Starting region server thread.
> [14/01/11 10:46:54] 100764 [        HMaster] FATAL 
> he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.net.SocketTimeoutException: 20000 millis timeout while waiting for 
> channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=192.168.1.102/192.168.1.102:60020]
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:311)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:865)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258)
>       at $Proxy14.getProtocolVersion(Unknown Source)
>       at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
>       at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
>       at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
>       at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:283)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:478)
>       at 
> org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>       at 
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>       at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to