[ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans updated HBASE-2707: -------------------------------------- Attachment: HBASE-2707.patch Stack and I talked a lot about it, here's what we came up with. It's very hard for me to come up with a unit test since it's all deep in the master and very much time-based, but I tested the patch with TestReplication a lot and 1) it doesn't fail anymore and 2) I see in the logs that the master does the right thing. Should I commit this? > Can't recover from a dead ROOT server if any exceptions happens during log > splitting > ------------------------------------------------------------------------------------ > > Key: HBASE-2707 > URL: https://issues.apache.org/jira/browse/HBASE-2707 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Priority: Blocker > Fix For: 0.21.0 > > Attachments: HBASE-2707.patch > > > There's an almost easy way to get stuck after a RS holding ROOT dies, usually > from a GC-like event. It happens frequently to my TestReplication in > HBASE-2223. > Some logs: > {code} > 2010-06-10 11:35:52,090 INFO [master] wal.HLog(1175): Spliting is done. > Removing old log dir > hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831 > 2010-06-10 11:35:52,095 WARN [master] > master.RegionServerOperationQueue(183): Failed processing: > ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed > todo queue > java.io.IOException: Cannot delete: > hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831 > at > org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179) > at > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298) > at > org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456) > Caused by: java.io.IOException: java.io.IOException: > /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty > 2010-06-10 11:35:52,097 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:53,098 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:53,523 INFO [main.serverMonitor] > master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average > load 14.0[10.10.1.63,55846,1276194933831] > 2010-06-10 11:35:54,099 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:55,101 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > {code} > The last lines are my own debug. Since we don't process the delayed todo if > ROOT isn't online, we'll never reassign the regions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.