The following steps can recreate the problem:
1. There's thousands of regions in the cluster.
2. Stop the cluster.
3. Start the cluster. Killing one regionserver while the regions were opening. 
Restarted it after 10 seconds.

The shutted regionserver will appear in the online list again:

2011-06-23 14:14:30,775 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: 
Server information: 167-6-1-12,20020,1308803390123=2220, 
167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2205, 
167-6-1-13,20020,1308803514394=2183

Two regionservers had the same hostname but different startcode:
167-6-1-13,20020,1308803391742=2374
167-6-1-13,20020,1308803514394=2183

Some other related logs:

2011-06-23 12:31:03,192 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
for 167-6-1-13,20020,1308803391742
2011-06-23 12:31:06,433 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 2468 
region(s) that 167-6-1-13,20020,1308803391742 was carrying (skipping 0 
regions(s) that are already in transition)


2011-06-23 12:31:54,687 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=167-6-1-13,20020,1308803514394, regionCount=0, userLoad=false

2011-06-23 12:39:30,773 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: 
Server information: 167-6-1-12,20020,1308803390123=2942, 
167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2917, 
167-6-1-13,20020,1308803514394=749


2374
2011-06-23 12:32:41,315 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting 2d655e795ea0044885ad60d2ea057113 on 
serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)
2011-06-23 12:32:41,315 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting ac98d67ec85c370a39fcf8b9b2ce76d1 on 
serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)
...
2011-06-23 12:34:30,613 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting 24e833b7503a74c66f1c62c72549cb70 on 
serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)
2011-06-23 12:34:30,681 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting d6c4392a580527d23018576357abed3e on 
serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)
2011-06-23 12:34:30,771 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting 832f72e0155c6a664c1a05a2b57e8687 on 
serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)



Jieshan Bean


--------------------------------------------------

Yes, it seems pretty strange.
You're right, there's so many logs, so I just picked off part of the logs.
I feel a hiding bug here. And I'm still working on it.

I feel the function of AssignmentManager#regionOnline has a problem:
1. ServerShutdownHandler will remove the server from the 
AssignmentManager.servers.
2. regionOnline will add the server into AssignmentManager.servers again.

What about a competition happened here?

  public void regionOnline(HRegionInfo regionInfo, HServerInfo serverInfo) {  
    synchronized (this.regionsInTransition) {
        ---------------
    }
    synchronized (this.regions) {
      // Add check
      ---------------
      addToServers(hsiWithoutLoad, regionInfo);
      this.regions.notifyAll();
    }
    // Remove plan if one.
    clearRegionPlan(regionInfo);
    // Update timers for all regions in transition going against this server.
    updateTimers(serverInfo);
  }
Thanks J-D.

Jieshan Bean



------------------------------------------------

As far as I can tell your server is alive because it opened the region
just fine:

Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000,
region=612342de1fe4733f72299d70addb6d11
Handling transition=RS_ZK_REGION_OPENING,
server=167-6-1-11,20020,1308106004710,
region=612342de1fe4733f72299d70addb6d11
Handling transition=RS_ZK_REGION_OPENING,
server=167-6-1-11,20020,1308106004710,
region=612342de1fe4733f72299d70addb6d11
Handling transition=RS_ZK_REGION_OPENED,
server=167-6-1-11,20020,1308106004710,
region=612342de1fe4733f72299d70addb6d11

It has a different start code from when it was assigned (1308106004710
vs 1308105402003) which is pretty strange, and on top of that the line
with M_ZK_REGION_OFFLINE shows port 20000 instead of 20020.

I believe that this is only a small snippet of a bigger log, right? I
bet there is more we could learn in there, also it'd be nice if you
told us what was happening on your cluster at the same time,
understanding the context would help.

J-D

On Wed, Jun 22, 2011 at 2:31 AM, bijieshan <[email protected]> wrote:
> 1.The Rs 167-6-1-11,20020,1308105402003 finished shutdown at "10:46:37,774":
>   10:46:37,774 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of 167-6-1-11,20020,1308105402003
>
> 2.Overwriting happened, it seemed the RS was still exist in the set of 
> AssignmentManager#regions:
> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
> Overwriting 612342de1fe4733f72299d70addb6d11 on 
> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
>
> 3.Region was assigned to this dead RS at "10:50:20,671":
> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
> Assigning region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 
> 167-6-1-11,20020,1308105402003
>
> I'm still working on this issue. I hope anyone can give out any suggestions.
>
> Thanks!
>
> Jieshan Bean
>
>
> Here's the logs tracked the region of "612342de1fe4733f72299d70addb6d11":
>
> 10:39:51,467 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Async create of unassigned node for 
> 612342de1fe4733f72299d70addb6d11 with OFFLINE state
> 10:39:51,994 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
>  
> rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
> state=OFFLINE, ts=1308105591467, server=167-6-1-11,20020,1308105402003
> 10:39:52,182 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
>  
> rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
> state=OFFLINE, ts=1308105591467
> 10:41:43,316 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENED, server=167-6-1-11,20020,1308105402003, 
> region=612342de1fe4733f72299d70addb6d11
> 10:43:22,938 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
> 10:43:22,938 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Deleting existing unassigned node for 
> 612342de1fe4733f72299d70addb6d11 that is in expected state RS_ZK_REGION_OPENED
> 10:43:23,153 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Successfully deleted unassigned node for 
> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
> 10:43:23,157 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. on 
> 167-6-1-11,20020,1308105402003
> 10:43:52,300 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 
> 2994 region(s) that 167-6-1-11,20020,1308105402003 was carrying (skipping 0 
> regions(s) that are already in transition)
> 10:45:54,914 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Creating (or updating) unassigned node for 
> 612342de1fe4733f72299d70addb6d11 with OFFLINE state
> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No 
> previous transition plan was found (or we are ignoring an existing plan) for 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. so 
> generated a random one; 
> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) 
> available servers
> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No 
> previous transition plan was found (or we are ignoring an existing plan) for 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. so 
> generated a random one; 
> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) 
> available servers
> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
> Assigning region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 
> 167-6-1-12,20020,1308105397174
> 10:45:55,029 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENING, server=167-6-1-12,20020,1308105397174, 
> region=612342de1fe4733f72299d70addb6d11
> 10:45:55,063 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENED, server=167-6-1-12,20020,1308105397174, 
> region=612342de1fe4733f72299d70addb6d11
> 10:45:55,063 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
> 10:45:55,063 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Deleting existing unassigned node for 
> 612342de1fe4733f72299d70addb6d11 that is in expected state RS_ZK_REGION_OPENED
> 10:45:55,072 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Successfully deleted unassigned node for 
> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
> Overwriting 612342de1fe4733f72299d70addb6d11 on 
> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0)
> 10:45:55,081 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. on 
> 167-6-1-12,20020,1308105397174
> 10:46:37,774 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of 167-6-1-11,20020,1308105402003
> 10:48:45,067 INFO org.apache.hadoop.hbase.master.HMaster: balance 
> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
> 10:48:45,067 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting 
> unassignment of region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. 
> (offlining)
> 10:48:45,068 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent 
> CLOSE to serverName=167-6-1-12,20020,1308105397174, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0) for region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> new unassigned node: /hbase/unassigned/612342de1fe4733f72299d70addb6d11 
> (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED)
> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> new unassigned node: /hbase/unassigned/612342de1fe4733f72299d70addb6d11 
> (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED)
> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=167-6-1-12,20020,1308105397174, 
> region=612342de1fe4733f72299d70addb6d11
> 10:50:20,527 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 612342de1fe4733f72299d70addb6d11
> 10:50:20,580 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing 
> OFFLINE; 
> was=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
>  state=CLOSED, ts=1308106125379
> 10:50:20,580 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Creating (or updating) unassigned node for 
> 612342de1fe4733f72299d70addb6d11 with OFFLINE state
> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using 
> pre-existing plan for region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; 
> plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using 
> pre-existing plan for region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; 
> plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.,
>  src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
> Assigning region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 
> 167-6-1-11,20020,1308105402003
> 10:50:20,692 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000, 
> region=612342de1fe4733f72299d70addb6d11
> 10:50:20,720 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENING, server=167-6-1-11,20020,1308106004710, 
> region=612342de1fe4733f72299d70addb6d11
> 10:50:20,743 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENING, server=167-6-1-11,20020,1308106004710, 
> region=612342de1fe4733f72299d70addb6d11
> 10:50:20,779 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENED, server=167-6-1-11,20020,1308106004710, 
> region=612342de1fe4733f72299d70addb6d11
> 10:51:33,245 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
> 10:51:33,245 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Deleting existing unassigned node for 
> 612342de1fe4733f72299d70addb6d11 that is in expected state RS_ZK_REGION_OPENED
> 10:51:33,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x330912836370000 Successfully deleted unassigned node for 
> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
> 10:51:33,312 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. on 
> 167-6-1-11,20020,1308106004710
>

Reply via email to