[ 
https://issues.apache.org/jira/browse/HBASE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667384#action_12667384
 ] 

Jim Kellerman commented on HBASE-1123:
--------------------------------------

On hbase-0.19 branch, I could not reproduce this. I killed the server holding 
root while cluster was under load
and it exited the waiting state in 1:01 (min:secs):

{code}
2009-01-26 19:26:32,266 INFO org.apache.hadoop.hbase.master.RegionManager: 
assigning region -ROOT-,,0 to server 208.76.44.141:8020

2009-01-26 19:41:08,396 INFO org.apache.hadoop.hbase.master.ServerManager: 
208.76.44.141:8020 lease expired
2009-01-26 19:42:10,757 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: Removed 
208.76.44.141:8020 from deadservers Map
{code}

I then waited for the cluster to rebalance, again put it under load, and killed 
the server holding the root region.
It took a little longer (2 min 19 sec) before the server was removed from the 
dead list.

{code}
2009-01-26 19:41:11,808 INFO org.apache.hadoop.hbase.master.RegionManager: 
assigning region -ROOT-,,0 to server 208.76.44.139:8020

2009-01-26 19:49:01,966 INFO org.apache.hadoop.hbase.master.ServerManager: 
208.76.44.139:8020 lease expired
2009-01-26 19:51:20,354 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: Removed 
208.76.44.139:8020 from deadservers Map
{code}

However, if leases included the start code, we could have put the restarted 
server back into service much sooner, as it
would not interfere with the splitting of logs (which include the start code in 
their name).



> Server never leaves the dead list though logs have all been processed if 
> crashed server had -ROOT- (seemingly)
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1123
>                 URL: https://issues.apache.org/jira/browse/HBASE-1123
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: 1123.patch
>
>
> Cluster is just hung after host that had -ROOT- completed splitting its 
> logs... old server is just stuck on the dead list and never comes off it.
> {code}
> ..
> 2009-01-13 01:09:36,448 [HMaster] DEBUG 
> org.apache.hadoop.hbase.regionserver.HLog: Splitting 6 of 6: 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020/hlog.dat.1231718928939
> 2009-01-13 01:09:37,396 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020 
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:38,591 [HMaster] DEBUG 
> org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for 
> path 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/712889985/oldlogfile.log
>  and region TestTable,0040922294,1231559109829
> 2009-01-13 01:09:38,670 [HMaster] DEBUG 
> org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for 
> path 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/484208094/oldlogfile.log
>  and region TestTable,0042007133,1231628296909
> 2009-01-13 01:09:45,096 [HMaster] INFO 
> org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020
> 2009-01-13 01:09:47,317 [SocketListener0-2] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location serverXX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:47,416 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020 
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:47,518 [IPC Server handler 3 on 60000] INFO 
> org.apache.hadoop.hbase.master.RegionManager: assigning region -ROOT-,,0 to 
> server XX.XX.XX141:60020
> 2009-01-13 01:09:49,007 [IPC Server handler 6 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Total Load: 430, Num Servers: 
> 3, Avg Load: 144.0
> 2009-01-13 01:09:50,219 [SocketListener0-0] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO 
> org.apache.hadoop.hbase.master.ServerManager: Received 
> MSG_REPORT_PROCESS_OPEN: -ROOT-,,0 from XX.XX.XX.141:60020
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO 
> org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: 
> -ROOT-,,0 from 208.76.44.141:60020
> 2009-01-13 01:09:50,719 [SocketListener0-3] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:50,967 [SocketListener0-4] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location serverXX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:52,117 [SocketListener0-5] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location 
> region name .META.,,1
> ....
> 2009-01-13 01:09:57,426 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020 
> removal from dead list before processing report-for-duty request
> ....
> 2009-01-13 01:10:45,156 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.HMaster: Processing todo: 
> ProcessServerShutdown of XX.XX.XX142:60020
> 2009-01-13 01:10:45,156 [HMaster] INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of 
> server XX.XX.XX.142:60020: logSplit: true, rootRescanned: false, 
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2009-01-13 01:10:45,156 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process 
> server shutdown scanning root region on XX.XX.XX.141
> 2009-01-13 01:10:45,182 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.RegionServerOperation: process server shutdown 
> scanning root region on XX.XX.XX.141 finished HMaster
> 2009-01-13 01:10:45,183 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanMetaRegions: process 
> server shutdown scanning .META.,,1 on XX.XX.XX.142:60020
> 2009-01-13 01:10:47,496 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020 
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:10:49,320 [IPC Server handler 8 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Total Load: 431, Num Servers: 
> 3, Avg Load: 144.0
> .....
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to