Can you instrument AssignmentManager.addToServers and see if that's really re-adding that node?
Thanks for digging, J-D On Thu, Jun 23, 2011 at 7:11 PM, bijieshan <[email protected]> wrote: > The following steps can recreate the problem: > 1. There's thousands of regions in the cluster. > 2. Stop the cluster. > 3. Start the cluster. Killing one regionserver while the regions were > opening. Restarted it after 10 seconds. > > The shutted regionserver will appear in the online list again: > > 2011-06-23 14:14:30,775 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: > Server information: 167-6-1-12,20020,1308803390123=2220, > 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2205, > 167-6-1-13,20020,1308803514394=2183 > > Two regionservers had the same hostname but different startcode: > 167-6-1-13,20020,1308803391742=2374 > 167-6-1-13,20020,1308803514394=2183 > > Some other related logs: > > 2011-06-23 12:31:03,192 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs > for 167-6-1-13,20020,1308803391742 > 2011-06-23 12:31:06,433 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning > 2468 region(s) that 167-6-1-13,20020,1308803391742 was carrying (skipping 0 > regions(s) that are already in transition) > > > 2011-06-23 12:31:54,687 INFO org.apache.hadoop.hbase.master.ServerManager: > Registering server=167-6-1-13,20020,1308803514394, regionCount=0, > userLoad=false > > 2011-06-23 12:39:30,773 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: > Server information: 167-6-1-12,20020,1308803390123=2942, > 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2917, > 167-6-1-13,20020,1308803514394=749 > > > 2374 > 2011-06-23 12:32:41,315 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > 2d655e795ea0044885ad60d2ea057113 on > serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0) > 2011-06-23 12:32:41,315 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > ac98d67ec85c370a39fcf8b9b2ce76d1 on > serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0) > ... > 2011-06-23 12:34:30,613 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > 24e833b7503a74c66f1c62c72549cb70 on > serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0) > 2011-06-23 12:34:30,681 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > d6c4392a580527d23018576357abed3e on > serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0) > 2011-06-23 12:34:30,771 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > 832f72e0155c6a664c1a05a2b57e8687 on > serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0) > > > > Jieshan Bean > > > -------------------------------------------------- > > Yes, it seems pretty strange. > You're right, there's so many logs, so I just picked off part of the logs. > I feel a hiding bug here. And I'm still working on it. > > I feel the function of AssignmentManager#regionOnline has a problem: > 1. ServerShutdownHandler will remove the server from the > AssignmentManager.servers. > 2. regionOnline will add the server into AssignmentManager.servers again. > > What about a competition happened here? > > public void regionOnline(HRegionInfo regionInfo, HServerInfo serverInfo) { > synchronized (this.regionsInTransition) { > --------------- > } > synchronized (this.regions) { > // Add check > --------------- > addToServers(hsiWithoutLoad, regionInfo); > this.regions.notifyAll(); > } > // Remove plan if one. > clearRegionPlan(regionInfo); > // Update timers for all regions in transition going against this server. > updateTimers(serverInfo); > } > Thanks J-D. > > Jieshan Bean > > > > ------------------------------------------------ > > As far as I can tell your server is alive because it opened the region > just fine: > > Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000, > region=612342de1fe4733f72299d70addb6d11 > Handling transition=RS_ZK_REGION_OPENING, > server=167-6-1-11,20020,1308106004710, > region=612342de1fe4733f72299d70addb6d11 > Handling transition=RS_ZK_REGION_OPENING, > server=167-6-1-11,20020,1308106004710, > region=612342de1fe4733f72299d70addb6d11 > Handling transition=RS_ZK_REGION_OPENED, > server=167-6-1-11,20020,1308106004710, > region=612342de1fe4733f72299d70addb6d11 > > It has a different start code from when it was assigned (1308106004710 > vs 1308105402003) which is pretty strange, and on top of that the line > with M_ZK_REGION_OFFLINE shows port 20000 instead of 20020. > > I believe that this is only a small snippet of a bigger log, right? I > bet there is more we could learn in there, also it'd be nice if you > told us what was happening on your cluster at the same time, > understanding the context would help. > > J-D > > On Wed, Jun 22, 2011 at 2:31 AM, bijieshan <[email protected]> wrote: >> 1.The Rs 167-6-1-11,20020,1308105402003 finished shutdown at "10:46:37,774": >> 10:46:37,774 INFO >> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished >> processing of shutdown of 167-6-1-11,20020,1308105402003 >> >> 2.Overwriting happened, it seemed the RS was still exist in the set of >> AssignmentManager#regions: >> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: >> Overwriting 612342de1fe4733f72299d70addb6d11 on >> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, >> usedHeap=0, maxHeap=0) >> >> 3.Region was assigned to this dead RS at "10:50:20,671": >> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Assigning region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> to 167-6-1-11,20020,1308105402003 >> >> I'm still working on this issue. I hope anyone can give out any suggestions. >> >> Thanks! >> >> Jieshan Bean >> >> >> Here's the logs tracked the region of "612342de1fe4733f72299d70addb6d11": >> >> 10:39:51,467 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Async create of unassigned node for >> 612342de1fe4733f72299d70addb6d11 with OFFLINE state >> 10:39:51,994 DEBUG >> org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: >> >> rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> state=OFFLINE, ts=1308105591467, server=167-6-1-11,20020,1308105402003 >> 10:39:52,182 DEBUG >> org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: >> >> rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> state=OFFLINE, ts=1308105591467 >> 10:41:43,316 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_OPENED, >> server=167-6-1-11,20020,1308105402003, >> region=612342de1fe4733f72299d70addb6d11 >> 10:43:22,938 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED >> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node >> 10:43:22,938 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Deleting existing unassigned node for >> 612342de1fe4733f72299d70addb6d11 that is in expected state >> RS_ZK_REGION_OPENED >> 10:43:23,153 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Successfully deleted unassigned node for >> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED >> 10:43:23,157 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> on 167-6-1-11,20020,1308105402003 >> 10:43:52,300 INFO >> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning >> 2994 region(s) that 167-6-1-11,20020,1308105402003 was carrying (skipping 0 >> regions(s) that are already in transition) >> 10:45:54,914 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Creating (or updating) unassigned node for >> 612342de1fe4733f72299d70addb6d11 with OFFLINE state >> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No >> previous transition plan was found (or we are ignoring an existing plan) for >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> so generated a random one; >> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) >> available servers >> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No >> previous transition plan was found (or we are ignoring an existing plan) for >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> so generated a random one; >> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) >> available servers >> 10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Assigning region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> to 167-6-1-12,20020,1308105397174 >> 10:45:55,029 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_OPENING, >> server=167-6-1-12,20020,1308105397174, >> region=612342de1fe4733f72299d70addb6d11 >> 10:45:55,063 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_OPENED, >> server=167-6-1-12,20020,1308105397174, >> region=612342de1fe4733f72299d70addb6d11 >> 10:45:55,063 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED >> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node >> 10:45:55,063 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Deleting existing unassigned node for >> 612342de1fe4733f72299d70addb6d11 that is in expected state >> RS_ZK_REGION_OPENED >> 10:45:55,072 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Successfully deleted unassigned node for >> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED >> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: >> Overwriting 612342de1fe4733f72299d70addb6d11 on >> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, >> usedHeap=0, maxHeap=0) >> 10:45:55,081 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> on 167-6-1-12,20020,1308105397174 >> 10:46:37,774 INFO >> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished >> processing of shutdown of 167-6-1-11,20020,1308105402003 >> 10:48:45,067 INFO org.apache.hadoop.hbase.master.HMaster: balance >> hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003 >> 10:48:45,067 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Starting unassignment of region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> (offlining) >> 10:48:45,068 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent >> CLOSE to serverName=167-6-1-12,20020,1308105397174, load=(requests=0, >> regions=0, usedHeap=0, maxHeap=0) for region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling new unassigned node: >> /hbase/unassigned/612342de1fe4733f72299d70addb6d11 >> (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED) >> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling new unassigned node: >> /hbase/unassigned/612342de1fe4733f72299d70addb6d11 >> (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED) >> 10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_CLOSED, >> server=167-6-1-12,20020,1308105397174, >> region=612342de1fe4733f72299d70addb6d11 >> 10:50:20,527 DEBUG >> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED >> event for 612342de1fe4733f72299d70addb6d11 >> 10:50:20,580 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing >> OFFLINE; >> was=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> state=CLOSED, ts=1308106125379 >> 10:50:20,580 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Creating (or updating) unassigned node for >> 612342de1fe4733f72299d70addb6d11 with OFFLINE state >> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using >> pre-existing plan for region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; >> plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003 >> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using >> pre-existing plan for region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; >> plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., >> src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003 >> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Assigning region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> to 167-6-1-11,20020,1308105402003 >> 10:50:20,692 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000, >> region=612342de1fe4733f72299d70addb6d11 >> 10:50:20,720 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_OPENING, >> server=167-6-1-11,20020,1308106004710, >> region=612342de1fe4733f72299d70addb6d11 >> 10:50:20,743 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_OPENING, >> server=167-6-1-11,20020,1308106004710, >> region=612342de1fe4733f72299d70addb6d11 >> 10:50:20,779 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: >> Handling transition=RS_ZK_REGION_OPENED, >> server=167-6-1-11,20020,1308106004710, >> region=612342de1fe4733f72299d70addb6d11 >> 10:51:33,245 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED >> event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node >> 10:51:33,245 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Deleting existing unassigned node for >> 612342de1fe4733f72299d70addb6d11 that is in expected state >> RS_ZK_REGION_OPENED >> 10:51:33,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> master:20000-0x330912836370000 Successfully deleted unassigned node for >> region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED >> 10:51:33,312 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region >> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. >> on 167-6-1-11,20020,1308106004710 >> >
