[ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055119#comment-13055119 ]
Ted Yu commented on HBASE-4031: ------------------------------- Registration of the 3 RS (including 158-1-101-82:20020) wasn't in HMaster222.log I noticed: {code} 2011-05-24 10:59:01,213 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to 158-1-101-222,20020,1306205940117 2011-05-24 10:59:11,067 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: RemoteException connecting to RS org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not running yet at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1038) ... 2011-05-24 10:59:11,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for hello,200040,1305944346902.2ce947cccbfe15b7210dd21d0cc2c515. so generated a random one; hri=hello,200040,1305944346902.2ce947cccbfe15b7210dd21d0cc2c515., src=, dest=158-1-101-82,20020,1306205415714; 4 (online=4, exclude=serverName=158-1-101-222,20020,1306205940117, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers {code} There is only 1 line from LoadBalancer in master log. If this scenario can be reproduced, please add more DEBUG log before line 251 in balanceCluster() I guess some regions were doubly counted on 158-1-101-222,20020 > An imbalance result calculated by LoadBalancer > ---------------------------------------------- > > Key: HBASE-4031 > URL: https://issues.apache.org/jira/browse/HBASE-4031 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Fix For: 0.90.4 > > Attachments: HMaster222.rar, HRegionServer222.rar > > > I found the problem while the cluster couldn't balance(Around time of > 2011-05-24 11:28).One node's regions count is the double of the other nodes. > And it didn't move regions anymore: > Address Start Code Load > 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, > maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, > usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, > regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 > requests=0, regions=2633, usedHeap=69, maxHeap=8165 > Total: servers: 4 requests=0, regions=13689 > HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was > found by my analysis on this problem. > But I'm afraid it's not the main cause of the problem. > There's one active master, one standby master, four regionservers in our > cluster. > >>10:57:41, the standby hamster 222 becomes the active one. > 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master > startup proceeding: master failover > >>4 regionservers was registered in 222 one by one. Only one regionserver > >>seemed some time late. > 2011-05-24 10:57:37,533 INFO : Registering > server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true > 2011-05-24 10:57:37,537 INFO : Registering > server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true > 2011-05-24 10:57:37,598 INFO : Registering > server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true > 2011-05-24 10:59:00,408 INFO : Registering > server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false > >>13134 regions needed to move after rebuildUserRegions(13689 regions in the > >>cluster during the time). > 2011-05-24 10:58:47,534 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to > process 13134 regions in transition > >>All the 13134 regions were opened, regions opened count in each server: > 158-1-101-222,20020,1306205940117 Count: 834 > 158-1-101-82,20020,1306205415714 Count: 4093 > 158-1-101-202,20020,1306205409671 Count: 4118 > 158-1-101-52,20020,1306205417261 Count: 4089 > >>The nearest balancer calculate results: > 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: > Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded > servers onto 1 less loaded servers > "5012" is an unimaginable number here, for it is larger than the average > number "3424.5" -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira