[ https://issues.apache.org/jira/browse/HADOOP-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573880#comment-14573880 ]
Duo Xu commented on HADOOP-11693: --------------------------------- Hi [~cnauroth] I have submitted a new patch, which does the retries in WASB rather than rely on Azure Storage SDK. As I looked into the source code this week, Azure Storage SDK regards storage exception as non-retryable, so when throttling happens, the current code might still not work. Could you reopen this JIRA and review it ASAP? Thanks. > Azure Storage FileSystem rename operations are throttled too aggressively to > complete HBase WAL archiving. > ---------------------------------------------------------------------------------------------------------- > > Key: HADOOP-11693 > URL: https://issues.apache.org/jira/browse/HADOOP-11693 > Project: Hadoop Common > Issue Type: Bug > Components: tools > Reporter: Duo Xu > Assignee: Duo Xu > Fix For: 2.7.0 > > Attachments: HADOOP-11681.01.patch, HADOOP-11681.02.patch, > HADOOP-11681.03.patch, HADOOP-11693.04.patch > > > One of our customers' production HBase clusters was periodically throttled by > Azure storage, when HBase was archiving old WALs. HMaster aborted the region > server and tried to restart it. > However, since the cluster was still being throttled by Azure storage, the > upcoming distributed log splitting also failed. Sometimes hbase:meta table > was on this region server and finally showed offline, which cause the whole > cluster in bad state. > {code} > 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region > server > workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044 > reported a fatal error: > ABORTING region server > workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE > in log roller > Cause: > org.apache.hadoop.fs.azure.AzureException: > com.microsoft.windowsazure.storage.StorageException: The server is busy. > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446) > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367) > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960) > at > org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593) > at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97) > at java.lang.Thread.run(Thread.java:745) > Caused by: com.microsoft.windowsazure.storage.StorageException: The server is > busy. > at > com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163) > at > com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306) > at > com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229) > at > com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762) > at > org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350) > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439) > ... 8 more > 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler: > Caught throwable while processing event M_META_SERVER_SHUTDOWN > java.io.IOException: failed log splitting for > workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901, > will retry > at > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.azure.AzureException: > com.microsoft.windowsazure.storage.StorageException: The server is busy. > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446) > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393) > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973) > at > org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293) > at > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64) > ... 4 more > Caused by: com.microsoft.windowsazure.storage.StorageException: The server is > busy. > at > com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163) > at > com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306) > at > com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229) > at > com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762) > at > org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350) > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439) > ... 11 more > Sun Mar 01 18:59:51 GMT 2015, > org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is > not online on > workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > When archiving old WALs, WASB will do rename operation by copying src blob to > destination blob and deleting the src blob. Copy blob is very costly in Azure > storage and during Azure storage gc, it will be highly likely throttled. The > throttling by Azure storage usually ends within 15mins. Current WASB retry > policy is exponential retry, but only last at most for 2min. Short term fix > will be adding a more intensive exponential retry when copy blob is throttled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)