[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423535#comment-13423535 ]
Elliott Clark commented on HBASE-6461: -------------------------------------- bq.Any chance to repeat this with distributed log splitting disabled? Sure gimme a little bit. Reading the logs on my latest example. Once JD and I are done (assuming that we don't solve it) I'll try with distributed log splitting off. On the latest run JD noticed that this in the logs of the server that was doing the splitting (There was only one hlog nothing had rolled yet). 10.4.3.44 is the server that had all java processes killed. {noformat} 453-2012-07-26 22:13:18,079 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 might be still open, length is 0 454-2012-07-26 22:13:18,081 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering file hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 455-2012-07-26 22:13:19,083 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Finished lease recover attempt for hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 456-2012-07-26 22:13:20,091 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 0 time(s). 457-2012-07-26 22:13:21,092 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 1 time(s). 458-2012-07-26 22:13:22,093 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 2 time(s). 459-2012-07-26 22:13:23,093 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 3 time(s). 460-2012-07-26 22:13:24,094 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 4 time(s). 461-2012-07-26 22:13:25,095 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 5 time(s). 462-2012-07-26 22:13:26,096 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 6 time(s). 463-2012-07-26 22:13:27,096 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 7 time(s). 464-2012-07-26 22:13:27,936 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=50.28 MB, free=5.94 GB, max=5.99 GB, blocks=0, accesses=0, hits=0, hitRatio=0cachingAccesses=0, cachingHits=0, cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN 465-2012-07-26 22:13:28,097 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 8 time(s). 466-2012-07-26 22:13:29,098 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 9 time(s). 467:2012-07-26 22:13:29,100 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 for reading. File is empty 468-java.io.EOFException 469- at java.io.DataInputStream.readFully(DataInputStream.java:180) 470- at java.io.DataInputStream.readFully(DataInputStream.java:152) 471- at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1521) 472- at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493) 473- at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480) 474- at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) 475- at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) 476- at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) 477- at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) 478- at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821) 479- at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734) 480- at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381) 481- at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348) 482- at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111) 483- at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264) 484- at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195) 485- at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163) 486- at java.lang.Thread.run(Thread.java:662) {noformat} So the log splitter is trying the dead server over and over. Then failing to read anything so root is empty. > Killing the HRegionServer and DataNode hosting ROOT can result in a malformed > root table. > ----------------------------------------------------------------------------------------- > > Key: HBASE-6461 > URL: https://issues.apache.org/jira/browse/HBASE-6461 > Project: HBase > Issue Type: Bug > Environment: hadoop-0.20.2-cdh3u3 > HBase 0.94.1 RC1 > Reporter: Elliott Clark > Priority: Critical > Fix For: 0.94.2 > > > Spun up a new dfs on hadoop-0.20.2-cdh3u3 > Started hbase > started running loadtest tool. > killed rs and dn holding root with killall -9 java on server sv4r27s44 at > about 2012-07-25 22:40:00 > After things stabilize Root is in a bad state. Ran hbck and got: > Exception in thread "main" > org.apache.hadoop.hbase.client.NoServerForRegionException: No server address > listed in -ROOT- for region .META.,,1.1028785192 containing row > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) > at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) > at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:172) > at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) > at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) > hbase(main):001:0> scan '-ROOT-' > ROW COLUMN+CELL > > > 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration > already set up for Hadoop, not re-installing. > .META.,,1 column=info:regioninfo, > timestamp=1343255838525, value={NAME => '.META.,,1', STARTKEY => '', ENDKEY > => '', ENCODED => 1028785192,} > .META.,,1 column=info:v, > timestamp=1343255838525, value=\x00\x00 > > 1 row(s) in 0.5930 seconds > Here's the master log: https://gist.github.com/3179194 > I tried the same thing with 0.92.1 and I was able to get into a similar > situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira