[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Prakash Khemani (JIRA) Thu, 21 Apr 2011 15:42:46 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022964#comment-13022964
 ]


Prakash Khemani commented on HBASE-3674:
----------------------------------------

The patch sets the hbase.hlog.split.skip.errors to true by default. I am 
wondering why the CheckSumException was not ignored as originally proposed?

This patch is there in the trunk. In the serialized log splitting case 
hbase.hlog.split.skip.errors is set to true. But in the distributed log 
splitting case hbase.hlog.split.skip.errors is set to false by default.

> Treat ChecksumException as we would a ParseException splitting logs; else we 
> replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we 
> skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- 
> hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException 
> {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf); 
>      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  
> We won't archive 31 of 35 files if we happen to get a checksum exception on 
> file 32.
> I think we should treat a ChecksumException the same as a ParseException; a 
> retry will not fix it if HDFS could not get around the ChecksumException 
> (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 
> 35: 
> hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481,
>  length=150       65662813573 2011-03-18 20:31:44,687 INFO 
> org.apache.hadoop.hbase.util.FSUtils: Recovering file 
> hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: 
> Found checksum error: b[0, 
> 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374
>        
> 6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564
>        
> 696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163
>        
> 636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61
>        
> 6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f
>        6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: 
> /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
>  at 15064576
> 813619         at 
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at 
> org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at 
> org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at 
> org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at 
> org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at 
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at 
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at 
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found 
> Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 
> at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could 
> not obtain block blk_7781725413191608261_14589573 from any node: 
> java.io.IOException: No live nodes contain current block. Will get new block 
> locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries 
> from 
> hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, 
> close out all writers tying off recovered.edits so far written.  We'll skip 
> archiving these files because we only archive if all files are processed; we 
> won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Reply via email to