[ https://issues.apache.org/jira/browse/HADOOP-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358830#comment-14358830 ]
Hudson commented on HADOOP-11693: --------------------------------- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #130 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/130/]) HADOOP-11693. Azure Storage FileSystem rename operations are throttled too aggressively to complete HBase WAL archiving. Contributed by Duo Xu. (cnauroth: rev 7a346bcb4fa5b56191ed00a39e72e51c9bdf1b56) * hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/AzureNativeFileSystemStore.java * hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterface.java * hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/MockStorageInterface.java * hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterfaceImpl.java * hadoop-common-project/hadoop-common/CHANGES.txt > Azure Storage FileSystem rename operations are throttled too aggressively to > complete HBase WAL archiving. > ---------------------------------------------------------------------------------------------------------- > > Key: HADOOP-11693 > URL: https://issues.apache.org/jira/browse/HADOOP-11693 > Project: Hadoop Common > Issue Type: Bug > Components: tools > Reporter: Duo Xu > Assignee: Duo Xu > Fix For: 2.7.0 > > Attachments: HADOOP-11681.01.patch, HADOOP-11681.02.patch, > HADOOP-11681.03.patch > > > One of our customers' production HBase clusters was periodically throttled by > Azure storage, when HBase was archiving old WALs. HMaster aborted the region > server and tried to restart it. > However, since the cluster was still being throttled by Azure storage, the > upcoming distributed log splitting also failed. Sometimes hbase:meta table > was on this region server and finally showed offline, which cause the whole > cluster in bad state. > {code} > 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region > server > workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044 > reported a fatal error: > ABORTING region server > workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE > in log roller > Cause: > org.apache.hadoop.fs.azure.AzureException: > com.microsoft.windowsazure.storage.StorageException: The server is busy. > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446) > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367) > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960) > at > org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593) > at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97) > at java.lang.Thread.run(Thread.java:745) > Caused by: com.microsoft.windowsazure.storage.StorageException: The server is > busy. > at > com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163) > at > com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306) > at > com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229) > at > com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762) > at > org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350) > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439) > ... 8 more > 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler: > Caught throwable while processing event M_META_SERVER_SHUTDOWN > java.io.IOException: failed log splitting for > workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901, > will retry > at > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.azure.AzureException: > com.microsoft.windowsazure.storage.StorageException: The server is busy. > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446) > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393) > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973) > at > org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293) > at > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64) > ... 4 more > Caused by: com.microsoft.windowsazure.storage.StorageException: The server is > busy. > at > com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163) > at > com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306) > at > com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229) > at > com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762) > at > org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350) > at > org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439) > ... 11 more > Sun Mar 01 18:59:51 GMT 2015, > org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is > not online on > workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > When archiving old WALs, WASB will do rename operation by copying src blob to > destination blob and deleting the src blob. Copy blob is very costly in Azure > storage and during Azure storage gc, it will be highly likely throttled. The > throttling by Azure storage usually ends within 15mins. Current WASB retry > policy is exponential retry, but only last at most for 2min. Short term fix > will be adding a more intensive exponential retry when copy blob is throttled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)