Can you instrument AssignmentManager.addToServers and see if that's
really re-adding that node?

Thanks for digging,

J-D

On Thu, Jun 23, 2011 at 7:11 PM, bijieshan <[email protected]> wrote:
> The following steps can recreate the problem:
> 1. There's thousands of regions in the cluster.
> 2. Stop the cluster.
> 3. Start the cluster. Killing one regionserver while the regions were 
> opening. Restarted it after 10 seconds.
>
> The shutted regionserver will appear in the online list again:
>
> 2011-06-23 14:14:30,775 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: 
> Server information: 167-6-1-12,20020,1308803390123=2220, 
> 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2205, 
> 167-6-1-13,20020,1308803514394=2183
>
> Two regionservers had the same hostname but different startcode:
> 167-6-1-13,20020,1308803391742=2374
> 167-6-1-13,20020,1308803514394=2183
>
> Some other related logs:
>
> 2011-06-23 12:31:03,192 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
> for 167-6-1-13,20020,1308803391742
> 2011-06-23 12:31:06,433 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 
> 2468 region(s) that 167-6-1-13,20020,1308803391742 was carrying (skipping 0 
> regions(s) that are already in transition)
>
>
> 2011-06-23 12:31:54,687 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Registering server=167-6-1-13,20020,1308803514394, regionCount=0, 
> userLoad=false
>
> 2011-06-23 12:39:30,773 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: 
> Server information: 167-6-1-12,20020,1308803390123=2942, 
> 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2917, 
> 167-6-1-13,20020,1308803514394=749
>
>
> 2374
> 2011-06-23 12:32:41,315 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
> 2d655e795ea0044885ad60d2ea057113 on 
> serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
> 2011-06-23 12:32:41,315 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
> ac98d67ec85c370a39fcf8b9b2ce76d1 on 
> serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
> ...
> 2011-06-23 12:34:30,613 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
> 24e833b7503a74c66f1c62c72549cb70 on 
> serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
> 2011-06-23 12:34:30,681 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
> d6c4392a580527d23018576357abed3e on 
> serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
> 2011-06-23 12:34:30,771 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
> 832f72e0155c6a664c1a05a2b57e8687 on 
> serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
>
>
>
> Jieshan Bean
>
>
> --------------------------------------------------
>
> Yes, it seems pretty strange.
> You're right, there's so many logs, so I just picked off part of the logs.
> I feel a hiding bug here. And I'm still working on it.
>
> I feel the function of AssignmentManager#regionOnline has a problem:
> 1. ServerShutdownHandler will remove the server from the 
> AssignmentManager.servers.
> 2. regionOnline will add the server into AssignmentManager.servers again.
>
> What about a competition happened here?
>
>  public void regionOnline(HRegionInfo regionInfo, HServerInfo serverInfo) {
>    synchronized (this.regionsInTransition) {
>        ---------------
>    }
>    synchronized (this.regions) {
>      // Add check
>      ---------------
>      addToServers(hsiWithoutLoad, regionInfo);
>      this.regions.notifyAll();
>    }
>    // Remove plan if one.
>    clearRegionPlan(regionInfo);
>    // Update timers for all regions in transition going against this server.
>    updateTimers(serverInfo);
>  }
> Thanks J-D.
>
> Jieshan Bean
>
>
>
> ------------------------------------------------
>
> As far as I can tell your server is alive because it opened the region
> just fine:
>
> Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000,
> region=612342de1fe4733f72299d70addb6d11
> Handling transition=RS_ZK_REGION_OPENING,
> server=167-6-1-11,20020,1308106004710,
> region=612342de1fe4733f72299d70addb6d11
> Handling transition=RS_ZK_REGION_OPENING,
> server=167-6-1-11,20020,1308106004710,
> region=612342de1fe4733f72299d70addb6d11
> Handling transition=RS_ZK_REGION_OPENED,
> server=167-6-1-11,20020,1308106004710,
> region=612342de1fe4733f72299d70addb6d11
>
> It has a different start code from when it was assigned (1308106004710
> vs 1308105402003) which is pretty strange, and on top of that the line
> with M_ZK_REGION_OFFLINE shows port 20000 instead of 20020.
>
> I believe that this is only a small snippet of a bigger log, right? I
> bet there is more we could learn in there, also it'd be nice if you
> told us what was happening on your cluster at the same time,
> understanding the context would help.
>
> J-D
>
> On Wed, Jun 22, 2011 at 2:31 AM, bijieshan <[email protected]> wrote:
>> 1.The Rs 167-6-1-11,20020,1308105402003 finished shutdown at "10:46:37,774":
>>   10:46:37,774 INFO 
>> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
>> processing of shutdown of 167-6-1-11,20020,1308105402003
>>
>> 2.Overwriting happened, it seemed the RS was still exist in the set of 
>> AssignmentManager#regions:
>> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
>> Overwriting 612342de1fe4733f72299d70addb6d11 on 
>> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, 
>> usedHeap=0, maxHeap=0)
>>
>> 3.Region was assigned to this dead RS at "10:50:20,671":
>> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Assigning region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> to 167-6-1-11,20020,1308105402003
>>
>> I'm still working on this issue. I hope anyone can give out any suggestions.
>>
>> Thanks!
>>
>> Jieshan Bean
>>
>>
>> Here's the logs tracked the region of "612342de1fe4733f72299d70addb6d11":
>>
>> 10:39:51,467 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Async create of unassigned node for 
>> 612342de1fe4733f72299d70addb6d11 with OFFLINE state
>> 10:39:51,994 DEBUG 
>> org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
>>  
>> rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
>>  state=OFFLINE, ts=1308105591467, server=167-6-1-11,20020,1308105402003
>> 10:39:52,182 DEBUG 
>> org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
>>  
>> rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
>>  state=OFFLINE, ts=1308105591467
>> 10:41:43,316 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_OPENED, 
>> server=167-6-1-11,20020,1308105402003, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:43:22,938 DEBUG 
>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
>> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
>> 10:43:22,938 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Deleting existing unassigned node for 
>> 612342de1fe4733f72299d70addb6d11 that is in expected state 
>> RS_ZK_REGION_OPENED
>> 10:43:23,153 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Successfully deleted unassigned node for 
>> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
>> 10:43:23,157 DEBUG 
>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> on 167-6-1-11,20020,1308105402003
>> 10:43:52,300 INFO 
>> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 
>> 2994 region(s) that 167-6-1-11,20020,1308105402003 was carrying (skipping 0 
>> regions(s) that are already in transition)
>> 10:45:54,914 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Creating (or updating) unassigned node for 
>> 612342de1fe4733f72299d70addb6d11 with OFFLINE state
>> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No 
>> previous transition plan was found (or we are ignoring an existing plan) for 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> so generated a random one; 
>> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) 
>> available servers
>> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No 
>> previous transition plan was found (or we are ignoring an existing plan) for 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> so generated a random one; 
>> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) 
>> available servers
>> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Assigning region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> to 167-6-1-12,20020,1308105397174
>> 10:45:55,029 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_OPENING, 
>> server=167-6-1-12,20020,1308105397174, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:45:55,063 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_OPENED, 
>> server=167-6-1-12,20020,1308105397174, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:45:55,063 DEBUG 
>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
>> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
>> 10:45:55,063 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Deleting existing unassigned node for 
>> 612342de1fe4733f72299d70addb6d11 that is in expected state 
>> RS_ZK_REGION_OPENED
>> 10:45:55,072 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Successfully deleted unassigned node for 
>> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
>> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
>> Overwriting 612342de1fe4733f72299d70addb6d11 on 
>> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, 
>> usedHeap=0, maxHeap=0)
>> 10:45:55,081 DEBUG 
>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> on 167-6-1-12,20020,1308105397174
>> 10:46:37,774 INFO 
>> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
>> processing of shutdown of 167-6-1-11,20020,1308105402003
>> 10:48:45,067 INFO org.apache.hadoop.hbase.master.HMaster: balance 
>> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
>> 10:48:45,067 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Starting unassignment of region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> (offlining)
>> 10:48:45,068 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent 
>> CLOSE to serverName=167-6-1-12,20020,1308105397174, load=(requests=0, 
>> regions=0, usedHeap=0, maxHeap=0) for region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
>> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling new unassigned node: 
>> /hbase/unassigned/612342de1fe4733f72299d70addb6d11 
>> (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED)
>> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling new unassigned node: 
>> /hbase/unassigned/612342de1fe4733f72299d70addb6d11 
>> (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED)
>> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_CLOSED, 
>> server=167-6-1-12,20020,1308105397174, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:50:20,527 DEBUG 
>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
>> event for 612342de1fe4733f72299d70addb6d11
>> 10:50:20,580 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing 
>> OFFLINE; 
>> was=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
>>  state=CLOSED, ts=1308106125379
>> 10:50:20,580 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Creating (or updating) unassigned node for 
>> 612342de1fe4733f72299d70addb6d11 with OFFLINE state
>> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using 
>> pre-existing plan for region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; 
>> plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
>> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using 
>> pre-existing plan for region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; 
>> plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>>  src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
>> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Assigning region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> to 167-6-1-11,20020,1308105402003
>> 10:50:20,692 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:50:20,720 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_OPENING, 
>> server=167-6-1-11,20020,1308106004710, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:50:20,743 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_OPENING, 
>> server=167-6-1-11,20020,1308106004710, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:50:20,779 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
>> Handling transition=RS_ZK_REGION_OPENED, 
>> server=167-6-1-11,20020,1308106004710, 
>> region=612342de1fe4733f72299d70addb6d11
>> 10:51:33,245 DEBUG 
>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
>> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
>> 10:51:33,245 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Deleting existing unassigned node for 
>> 612342de1fe4733f72299d70addb6d11 that is in expected state 
>> RS_ZK_REGION_OPENED
>> 10:51:33,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
>> master:20000-0x330912836370000 Successfully deleted unassigned node for 
>> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
>> 10:51:33,312 DEBUG 
>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
>> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
>> on 167-6-1-11,20020,1308106004710
>>
>

Reply via email to