[ 
https://issues.apache.org/jira/browse/RATIS-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-1891:
------------------------------
    Component/s: RaftLog
    Description: 
(Original Summary) Gap between logs cause service startup failure

This is the second raft gap problem reported by Guo Hao.

{code:java}
2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start 
failed with exception
java.util.concurrent.CompletionException: java.lang.IllegalStateException: gap 
between start index 375 and first entry to append 377
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
        at 
java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191)
        at 
org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: gap between start index 375 and 
first entry to append 377
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
        at 
org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
        at 
org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239)
        at 
org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220)
        at 
org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161)
        at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
        at 
org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177)
        at 
org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
        at 
org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
        ... 4 more 
{code}


# The gap server directory 

{code:java}

$ ll 
/home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/
total 4120
-rw-rw-r-- 1 work work   14567 Sep  8 18:30 log_291-374
-rw-rw-r-- 1 work work 4194304 Sep  8 18:30 log_inprogress_375
-rw-rw-r-- 1 work work      50 Sep  8 18:30 raft-meta
-rw-rw-r-- 1 work work     242 Sep  8 17:29 raft-meta.conf 

{code}

 

The other two


{code:java}
$ ll
total 4168
-rw-rw-r-- 1 work work      95 Sep  8 12:13 log_0-0
-rw-rw-r-- 1 work work   39285 Sep  8 17:30 log_1-290
-rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
-rw-rw-r-- 1 work work     271 Sep  8 17:50 log_375-376
-rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
-rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
-rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 

{code}


{code:java}
$ ll
total 4168
-rw-rw-r-- 1 work work      95 Sep  8 13:15 log_0-0
-rw-rw-r-- 1 work work   39285 Sep  8 17:29 log_1-290
-rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
-rw-rw-r-- 1 work work     271 Sep  8 18:29 log_375-376
-rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
-rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
-rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 

{code}

 

Related Configurations:


{code:java}
<property>
<name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name>
<value>false</value>
</property>
<property>
<name>hdds.ratis.raft.server.log.async-flush.enabled</name>
<value>false</value>
</property> 
{code}


 

The scene in which the GAP occurs this time is as follows:
1. shutdown scm, shutdown more than 60s timeout kill -9
2. restart scm, this error occurs


  was:
This is the second raft gap problem reported by Guo Hao.




{code:java}
2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start 
failed with exception
java.util.concurrent.CompletionException: java.lang.IllegalStateException: gap 
between start index 375 and first entry to append 377
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
        at 
java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191)
        at 
org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: gap between start index 375 and 
first entry to append 377
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
        at 
org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
        at 
org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239)
        at 
org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220)
        at 
org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161)
        at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
        at 
org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177)
        at 
org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
        at 
org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
        ... 4 more 
{code}


# The gap server directory 

{code:java}

$ ll 
/home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/
total 4120
-rw-rw-r-- 1 work work   14567 Sep  8 18:30 log_291-374
-rw-rw-r-- 1 work work 4194304 Sep  8 18:30 log_inprogress_375
-rw-rw-r-- 1 work work      50 Sep  8 18:30 raft-meta
-rw-rw-r-- 1 work work     242 Sep  8 17:29 raft-meta.conf 

{code}

 

The other two


{code:java}
$ ll
total 4168
-rw-rw-r-- 1 work work      95 Sep  8 12:13 log_0-0
-rw-rw-r-- 1 work work   39285 Sep  8 17:30 log_1-290
-rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
-rw-rw-r-- 1 work work     271 Sep  8 17:50 log_375-376
-rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
-rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
-rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 

{code}


{code:java}
$ ll
total 4168
-rw-rw-r-- 1 work work      95 Sep  8 13:15 log_0-0
-rw-rw-r-- 1 work work   39285 Sep  8 17:29 log_1-290
-rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
-rw-rw-r-- 1 work work     271 Sep  8 18:29 log_375-376
-rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
-rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
-rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 

{code}

 

Related Configurations:


{code:java}
<property>
<name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name>
<value>false</value>
</property>
<property>
<name>hdds.ratis.raft.server.log.async-flush.enabled</name>
<value>false</value>
</property> 
{code}


 

The scene in which the GAP occurs this time is as follows:
1. shutdown scm, shutdown more than 60s timeout kill -9
2. restart scm, this error occurs


        Summary: A tool to fix gaps in RaftLog  (was: Gap between logs cause 
service startup failure)

[~Nicholas Niu], [~srnjl], [~Sammi], I wonder if you still keep seeing gaps in 
RaftLog?

Anyway, a tool fixing the gaps definitely is a good idea.  Revising the Summary 
...

> A tool to fix gaps in RaftLog
> -----------------------------
>
>                 Key: RATIS-1891
>                 URL: https://issues.apache.org/jira/browse/RATIS-1891
>             Project: Ratis
>          Issue Type: Bug
>          Components: RaftLog
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Critical
>         Attachments: log_291-374, log_375-376, log_inprogress_375, 
> log_inprogress_377
>
>
> (Original Summary) Gap between logs cause service startup failure
> This is the second raft gap problem reported by Guo Hao.
> {code:java}
> 2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start 
> failed with exception
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> gap between start index 375 and first entry to append 377
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at 
> java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
>         at 
> java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191)
>         at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: gap between start index 375 and 
> first entry to append 377
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
>         at 
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220)
>         at 
> org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161)
>         at 
> org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
>         at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
>         ... 4 more 
> {code}
> # The gap server directory 
> {code:java}
> $ ll 
> /home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/
> total 4120
> -rw-rw-r-- 1 work work   14567 Sep  8 18:30 log_291-374
> -rw-rw-r-- 1 work work 4194304 Sep  8 18:30 log_inprogress_375
> -rw-rw-r-- 1 work work      50 Sep  8 18:30 raft-meta
> -rw-rw-r-- 1 work work     242 Sep  8 17:29 raft-meta.conf 
> {code}
>  
> The other two
> {code:java}
> $ ll
> total 4168
> -rw-rw-r-- 1 work work      95 Sep  8 12:13 log_0-0
> -rw-rw-r-- 1 work work   39285 Sep  8 17:30 log_1-290
> -rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
> -rw-rw-r-- 1 work work     271 Sep  8 17:50 log_375-376
> -rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
> -rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
> -rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 
> {code}
> {code:java}
> $ ll
> total 4168
> -rw-rw-r-- 1 work work      95 Sep  8 13:15 log_0-0
> -rw-rw-r-- 1 work work   39285 Sep  8 17:29 log_1-290
> -rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
> -rw-rw-r-- 1 work work     271 Sep  8 18:29 log_375-376
> -rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
> -rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
> -rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 
> {code}
>  
> Related Configurations:
> {code:java}
> <property>
> <name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name>
> <value>false</value>
> </property>
> <property>
> <name>hdds.ratis.raft.server.log.async-flush.enabled</name>
> <value>false</value>
> </property> 
> {code}
>  
> The scene in which the GAP occurs this time is as follows:
> 1. shutdown scm, shutdown more than 60s timeout kill -9
> 2. restart scm, this error occurs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to