[ https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931174#comment-16931174 ]
Lokesh Jain commented on RATIS-677: ----------------------------------- [~szetszwo] Thanks for working on this! If we ignore the exception while reading a segment file wouldn't that make the log segments inconsistent? If the segment with the error or the segments after that are used later, it might lead to unpredictable results. > Logentry marked corrupt due to ChecksumException > ------------------------------------------------ > > Key: RATIS-677 > URL: https://issues.apache.org/jira/browse/RATIS-677 > Project: Ratis > Issue Type: Bug > Components: server > Reporter: Sammi Chen > Assignee: Tsz Wo Nicholas Sze > Priority: Blocker > Attachments: r677_20190913.patch > > > Steps: > 1. Run Teragen and generated a few GB data in a 4 datanodes cluster. > 2. Stoped the datanodes through ./stop-ozone.sh. > 3. Changed the ozone binaries > 4. Start the cluster through ./start-ozone.sh. > 5. Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. > > Checked these two failed node, datanode process is still running. In the > logfile, I found a lot of following errors. > 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO - > Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858 > 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO - > Attempting to start container services. > 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO - > Background container scanner has been disabled. > 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO - > Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858 > 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR - > Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 > seconds. > org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated > checksum is -134141393 but read checksum 0 > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204) > at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247) > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190) > at > org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120) > at > org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:110) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.2#803003)