[ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211179#comment-13211179 ]
chunhui shen commented on HBASE-5422: ------------------------------------- @Stack OK, I will submit a new patch tomorrow.(today is outing...) Also I think it's ok to make the new addPlan method when commit if no other problems. Thanks. > StartupBulkAssigner would cause a lot of timeout on RIT when assigning large > numbers of regions (timeout = 3 mins) > ------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-5422 > URL: https://issues.apache.org/jira/browse/HBASE-5422 > Project: HBase > Issue Type: Bug > Components: master > Reporter: chunhui shen > Attachments: 5422-90.patch, hbase-5422.patch > > > In our produce environment > We find a lot of timeout on RIT when cluster up, there are about 7w regions > in the cluster( 25 regionservers ). > First, we could see the following log:(See the region > 33cf229845b1009aa8a3f7b0f85c9bd0) > master's log > 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x348f4a94723da5 Async create of unassigned node for > 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state > 2012-02-13 18:07:42,560 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: > rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > state=OFFLINE, ts=1329127661409, > server=r03f11025.yh.aliyun.com,60020,1329127549907 > 2012-02-13 18:07:42,996 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: > rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > state=OFFLINE, ts=1329127661409 > 2012-02-13 18:10:48,072 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > state=PENDING_OPEN, ts=1329127662996 > 2012-02-13 18:10:48,072 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_OPEN for too long, reassigning > region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > 2012-02-13 18:11:16,744 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENED, > server=r03f11025.yh.aliyun.com,60020,1329127549907, > region=33cf229845b1009aa8a3f7b0f85c9bd0 > 2012-02-13 18:38:07,310 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED > event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node > 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x348f4a94723da5 Deleting existing unassigned node for > 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state > RS_ZK_REGION_OPENED > 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x348f4a94723da5 Successfully deleted unassigned node for region > 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED > 2012-02-13 18:38:07,573 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region > item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on > r03f11025.yh.aliyun.com,60020,1329127549907 > 2012-02-13 18:50:54,428 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan > was found (or we are ignoring an existing plan) for > item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so > generated a random one; > hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., > src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, > exclude=null) available servers > 2012-02-13 18:50:54,428 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to > r01b05043.yh.aliyun.com,60020,1329127549041 > 2012-02-13 19:31:50,514 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > state=PENDING_OPEN, ts=1329132528086 > 2012-02-13 19:31:50,514 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_OPEN for too long, reassigning > region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > Regionserver's log > 2012-02-13 18:07:43,537 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open > region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > 2012-02-13 18:11:16,560 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing > open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > Through the RS's log, we could find it is larger than 3mins from receive > openRegion request to start processing openRegion, causing timeout on RIT in > master for the region. > Let's see the code of StartupBulkAssigner, we could find regionPlans are not > added when assigning regions, therefore, when one region opened, it will not > updateTimers of other regions whose destination is the same. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira