[hbase] Stuck replay of failed regionserver edits
-------------------------------------------------

                 Key: HADOOP-2282
                 URL: https://issues.apache.org/jira/browse/HADOOP-2282
             Project: Hadoop
          Issue Type: Bug
            Reporter: stack
            Priority: Minor


Looking in master for a cluster of ~90 regionservers, the regionserver carrying 
the ROOT went down (because it hadn't talked to the master in 30 seconds).

Master notices the downed regionserver because its lease timesout.  It then 
goes to run the shutdown server sequence only splitting the regionserver's edit 
log, it gets stuck trying to split the second of three log files.  Eventually, 
after ~5minutes, the second log split throws:

{code}
  34974 2007-11-26 01:21:23,999 WARN  hbase.HMaster - Processing pending 
operations: ProcessServerShutdown of 38.99.76.15:60020
  34975 org.apache.hadoop.dfs.AlreadyBeingCreatedException: 
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file 
/hbase/hregion_-1194436719/oldlogfile.log for DFSClient_610028837 on client 
38.99.77.80 because curren        t leaseholder is trying to recreate file.
  34976     at 
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:848)
  34977     at 
org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:804)
  34978     at org.apache.hadoop.dfs.NameNode.create(NameNode.java:276)
  34979     at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
  34980     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  34981     at java.lang.reflect.Method.invoke(Method.java:597)
  34982     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
  34983     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
  34984 
  34985     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
  34986     at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
  34987     at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
  34988     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
  34989     at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
  34990     at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1094)
{code}

And so on every 5 minutes.

Because the regionserver that went down had ROOT region, and because we are 
stuck in this eternal loop, ROOT never gets reallocated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to