[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036965#comment-17036965
 ] 

Lokesh Jain edited comment on RATIS-815 at 2/14/20 1:31 PM:
------------------------------------------------------------

Analysis so far.
 # On avoiding checksum exception the parsing of log entry fails with below 
exception.
{code:java}
org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: 
Protocol message contained an invalid tag 
(zero).org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException:
 Protocol message contained an invalid tag (zero). at 
org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:102)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readTag(CodedInputStream.java:627)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5395)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5356)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6235)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6229)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:888)
 at 
org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6866) at 
org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6808) at 
org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8018)
 at 
org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8012)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:365)
 at 
org.apache.ratis.proto.RaftProtos$LogEntryProto.parseFrom(RaftProtos.java:7273) 
at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:321)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
 at 
org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
 at org.apache.ratis.tools.ParseRatisLog.dumpSegmentFile(ParseRatisLog.java:59) 
at org.apache.ratis.tools.DefaultLogDump.main(DefaultLogDump.java:40){code}

 # On avoiding both checksum check and parsing of log entry the subsequent 
entries can be parsed. But there are multiple entries in the same log segment 
file which seem to have the corruption problem.
 # Tried to debug the code as to where parsing of the entry was failing. All 
the entries seen so far are of type StateMachineLogEntryProto. The parsing 
fails after reading the logData field in StateMachineLogEntryProto. It is able 
to correctly read term and index fields.

Possible reasons of corruption which are being explored.
 # Corruption during flush of log.
 # Error related to StateMachineEntry which is part of 
StateMachineLogEntryProto. This proto contains the stateMachineData.
 # Corruption related to change in RATIS-767. We reuse a byte buffer in this 
Jira for log write.


was (Author: ljain):
Analysis so far.
 # On avoiding checksum exception the parsing of log entry fails with below 
exception.
{code:java}
org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: 
Protocol message contained an invalid tag 
(zero).org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException:
 Protocol message contained an invalid tag (zero). at 
org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:102)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readTag(CodedInputStream.java:627)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5395)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5356)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6235)
 at 
org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6229)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:888)
 at 
org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6866) at 
org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6808) at 
org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8018)
 at 
org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8012)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48)
 at 
org.apache.ratis.thirdparty.com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:365)
 at 
org.apache.ratis.proto.RaftProtos$LogEntryProto.parseFrom(RaftProtos.java:7273) 
at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:321)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
 at 
org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
 at org.apache.ratis.tools.ParseRatisLog.dumpSegmentFile(ParseRatisLog.java:59) 
at org.apache.ratis.tools.DefaultLogDump.main(DefaultLogDump.java:40){code}

 # On avoiding both checksum check and parsing of log entry the subsequent 
entries can be parsed. But there are multiple entries in the same log segment 
file which seem to have the corruption problem.
 # Tried to debug the code as to where parsing of the entry was failing. All 
the entries seen so far are of type StateMachineLogEntryProto. The parsing 
fails after reading the logData field in StateMachineLogEntryProto.

Possible reasons of corruption which are being explored.
 # Corruption during flush of log.
 # Error related to StateMachineEntry which is part of 
StateMachineLogEntryProto. This proto contains the stateMachineData.
 # Corruption related to change in RATIS-767. We reuse a byte buffer in this 
Jira for log write.

> Log entry corrupted with 0 checksum
> -----------------------------------
>
>                 Key: RATIS-815
>                 URL: https://issues.apache.org/jira/browse/RATIS-815
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Attila Doroszlai
>            Priority: Blocker
>         Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is 00000000.
>       at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>       at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>       at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>       at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>       at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is 00000000.
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>       at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>       at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>       at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>       ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
>     {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
>     {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> 00000000.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to