[ 
https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423535#comment-13423535
 ] 

Elliott Clark commented on HBASE-6461:
--------------------------------------

bq.Any chance to repeat this with distributed log splitting disabled?
Sure gimme a little bit.  Reading the logs on my latest example.  Once JD and I 
are done (assuming that we don't solve it) I'll try with distributed log 
splitting off.


On the latest run JD noticed that this in the logs of the server that was doing 
the splitting (There was only one hlog nothing had rolled yet). 10.4.3.44 is 
the server that had all java processes killed.
{noformat}
453-2012-07-26 22:13:18,079 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File 
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
 might be still open, length is 0
454-2012-07-26 22:13:18,081 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
Recovering file 
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
455-2012-07-26 22:13:19,083 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
Finished lease recover attempt for 
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
456-2012-07-26 22:13:20,091 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 0 time(s).
457-2012-07-26 22:13:21,092 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 1 time(s).
458-2012-07-26 22:13:22,093 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 2 time(s).
459-2012-07-26 22:13:23,093 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 3 time(s).
460-2012-07-26 22:13:24,094 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 4 time(s).
461-2012-07-26 22:13:25,095 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 5 time(s).
462-2012-07-26 22:13:26,096 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 6 time(s).
463-2012-07-26 22:13:27,096 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 7 time(s).
464-2012-07-26 22:13:27,936 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=50.28 MB, 
free=5.94 GB, max=5.99 GB, blocks=0, accesses=0, hits=0, 
hitRatio=0cachingAccesses=0, cachingHits=0, cachingHitsRatio=0evictions=0, 
evicted=0, evictedPerRun=NaN
465-2012-07-26 22:13:28,097 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 8 time(s).
466-2012-07-26 22:13:29,098 INFO org.apache.hadoop.ipc.Client: Retrying connect 
to server: /10.4.3.44:9910. Already tried 9 time(s).
467:2012-07-26 22:13:29,100 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open 
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
 for reading. File is empty
468-java.io.EOFException
469-    at java.io.DataInputStream.readFully(DataInputStream.java:180)
470-    at java.io.DataInputStream.readFully(DataInputStream.java:152)
471-    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1521)
472-    at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493)
473-    at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480)
474-    at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
475-    at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
476-    at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
477-    at 
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716)
478-    at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
479-    at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
480-    at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
481-    at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
482-    at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
483-    at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
484-    at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
485-    at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
486-    at java.lang.Thread.run(Thread.java:662)
{noformat}

So the log splitter is trying the dead server over and over.  Then failing to 
read anything so root is empty.
                
> Killing the HRegionServer and DataNode hosting ROOT can result in a malformed 
> root table.
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-6461
>                 URL: https://issues.apache.org/jira/browse/HBASE-6461
>             Project: HBase
>          Issue Type: Bug
>         Environment: hadoop-0.20.2-cdh3u3
> HBase 0.94.1 RC1
>            Reporter: Elliott Clark
>            Priority: Critical
>             Fix For: 0.94.2
>
>
> Spun up a new dfs on hadoop-0.20.2-cdh3u3
> Started hbase
> started running loadtest tool.
> killed rs and dn holding root with killall -9 java on server sv4r27s44 at 
> about 2012-07-25 22:40:00
> After things stabilize Root is in a bad state. Ran hbck and got:
> Exception in thread "main" 
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server address 
> listed in -ROOT- for region .META.,,1.1028785192 containing row 
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016)
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841)
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232)
> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:172)
> at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241)
> at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236)
> hbase(main):001:0> scan '-ROOT-'
> ROW                                           COLUMN+CELL                     
>                                                                               
>                     
> 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration 
> already set up for Hadoop, not re-installing.
>  .META.,,1                                    column=info:regioninfo, 
> timestamp=1343255838525, value={NAME => '.META.,,1', STARTKEY => '', ENDKEY 
> => '', ENCODED => 1028785192,}
>  .META.,,1                                    column=info:v, 
> timestamp=1343255838525, value=\x00\x00                                       
>                                      
> 1 row(s) in 0.5930 seconds
> Here's the master log: https://gist.github.com/3179194
> I tried the same thing with 0.92.1 and I was able to get into a similar 
> situation, so I don't think this is anything new. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to