[ 
https://issues.apache.org/jira/browse/HADOOP-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573880#comment-14573880
 ] 

Duo Xu commented on HADOOP-11693:
---------------------------------

Hi [~cnauroth]

I have submitted a new patch, which does the retries in WASB rather than rely 
on Azure Storage SDK. As I looked into the source code this week, Azure Storage 
SDK regards storage exception as non-retryable, so when throttling happens, the 
current code might still not work.

Could you reopen this JIRA and review it ASAP?

Thanks.

> Azure Storage FileSystem rename operations are throttled too aggressively to 
> complete HBase WAL archiving.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11693
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11693
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>            Reporter: Duo Xu
>            Assignee: Duo Xu
>             Fix For: 2.7.0
>
>         Attachments: HADOOP-11681.01.patch, HADOOP-11681.02.patch, 
> HADOOP-11681.03.patch, HADOOP-11693.04.patch
>
>
> One of our customers' production HBase clusters was periodically throttled by 
> Azure storage, when HBase was archiving old WALs. HMaster aborted the region 
> server and tried to restart it.
> However, since the cluster was still being throttled by Azure storage, the 
> upcoming distributed log splitting also failed. Sometimes hbase:meta table 
> was on this region server and finally showed offline, which cause the whole 
> cluster in bad state.
> {code}
> 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
> server 
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044 
> reported a fatal error:
> ABORTING region server 
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE 
> in log roller
> Cause:
> org.apache.hadoop.fs.azure.AzureException: 
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960)
>       at 
> org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593)
>       at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is 
> busy.
>       at 
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
>       at 
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
>       at 
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
>       at 
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
>       at 
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
>       ... 8 more
> 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
> Caught throwable while processing event M_META_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for 
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901, 
> will retry
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException: 
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64)
>       ... 4 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is 
> busy.
>       at 
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
>       at 
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
>       at 
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
>       at 
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
>       at 
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
>       ... 11 more
> Sun Mar 01 18:59:51 GMT 2015, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is 
> not online on 
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>       at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>       at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>       at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> When archiving old WALs, WASB will do rename operation by copying src blob to 
> destination blob and deleting the src blob. Copy blob is very costly in Azure 
> storage and during Azure storage gc, it will be highly likely throttled. The 
> throttling by Azure storage usually ends within 15mins. Current WASB retry 
> policy is exponential retry, but only last at most for 2min. Short term fix 
> will be adding a more intensive exponential retry when copy blob is throttled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to