Duo Xu created HADOOP-11681:
-------------------------------

             Summary: HBase gives up retries after being throttled by Azure 
storage
                 Key: HADOOP-11681
                 URL: https://issues.apache.org/jira/browse/HADOOP-11681
             Project: Hadoop Common
          Issue Type: Bug
          Components: tools
            Reporter: Duo Xu
            Assignee: Duo Xu


One of our customers' production HBase clusters was periodically throttled by 
Azure storage, when HBase was archiving old WALs. HMaster aborted the region 
server and tried to restart it.

However, since the cluster was still being throttled by Azure storage, the 
upcoming distributed log splitting also failed. Sometimes hbase:meta table was 
on this region server and finally showed offline, which cause the whole cluster 
in bad state.

{code}
2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044 
reported a fatal error:
ABORTING region server 
workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE 
in log roller
Cause:
org.apache.hadoop.fs.azure.AzureException: 
com.microsoft.windowsazure.storage.StorageException: The server is busy.
        at 
org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
        at 
org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367)
        at 
org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960)
        at 
org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.microsoft.windowsazure.storage.StorageException: The server is 
busy.
        at 
com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
        at 
com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
        at 
com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
        at 
com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
        at 
org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
        at 
org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
        ... 8 more

2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
Caught throwable while processing event M_META_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for 
workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901, will 
retry
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.azure.AzureException: 
com.microsoft.windowsazure.storage.StorageException: The server is busy.
        at 
org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
        at 
org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393)
        at 
org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64)
        ... 4 more
Caused by: com.microsoft.windowsazure.storage.StorageException: The server is 
busy.
        at 
com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
        at 
com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
        at 
com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
        at 
com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
        at 
org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
        at 
org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
        ... 11 more

Sun Mar 01 18:59:51 GMT 2015, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7, 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not 
online on 
workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:745)
{code}

When archiving old WALs, WASB will do rename operation by copying src blob to 
destination blob and deleting the src blob. Copy blob is very costly in Azure 
storage and during Azure storage gc, it will be highly likely throttled. The 
throttling by Azure storage usually ends within 15mins. Current WASB retry 
policy is exponential retry, but only last at most for 2min. Short term fix 
will be adding a more intensive exponential retry when copy blob is throttled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to