[jira] [Commented] (RATIS-2192) Lots of errors after applying RATIS-2129

Tsz-wo Sze (Jira) Thu, 08 May 2025 12:04:21 -0700


    [ 
https://issues.apache.org/jira/browse/RATIS-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950357#comment-17950357
 ]


Tsz-wo Sze commented on RATIS-2192:
-----------------------------------

BTW, there is a discussion in the dev@ mailing list for moving the (incomplete) 
zero-copy feature to a development branch:
- https://lists.apache.org/thread/y2y7dyff1mmctgbo84lsytwg8bdzsq35

As the same time, we propose to rework RATIS-2129.

> Lots of errors after applying RATIS-2129
> ----------------------------------------
>
>                 Key: RATIS-2192
>                 URL: https://issues.apache.org/jira/browse/RATIS-2192
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Wei-Chiu Chuang
>            Priority: Blocker
>         Attachments: ozone-datanode.1.tgz, ozone-datanode.2.tgz, 
> ozone-datanode.3.tgz
>
>
> Ok to be honest I am not sure if it's related to RATIS-2129. But I'm using a 
> build that is Ratis 3.1.1 + RATIS-2129, and I am seeing all kinds of errors 
> running HBase on Ozone.
> failed to take snapshot due to last applied txn not current:
> {noformat}
> 2024-11-16 00:10:31,035 INFO 
> [grpc-default-executor-22]-org.apache.ratis.server.RaftServer: 
> e693615a-d484-4165-8446-dff08cac5978: remove  FOLLOWER 
> e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1:t229, 
> leader=67eefe63-0930-42d7-a364-e46fde563ff1, 
> voted=67eefe63-0930-42d7-a364-e46fde563ff1, 
> raftlog=Memoized:e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-SegmentedRaftLog:OPENED:c1613342:last(t:229,
>  i:1613343), conf=conf: {index: 1613340, 
> cur=peers:[e693615a-d484-4165-8446-dff08cac5978|10.140.146.67:9856, 
> 67eefe63-0930-42d7-a364-e46fde563ff1|10.140.86.199:9856, 
> 7cc563b3-14b5-4334-820b-5c3bbecffad8|10.140.20.0:9856]|listeners:[], 
> old=null} RUNNING
> 2024-11-16 00:10:31,038 INFO 
> [grpc-default-executor-22]-org.apache.ratis.server.RaftServer$Division: 
> e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1: shutdown
> 2024-11-16 00:10:31,039 INFO 
> [grpc-default-executor-22]-org.apache.ratis.util.JmxRegister: Successfully 
> un-registered JMX Bean with object name 
> Ratis:service=RaftServer,group=group-AF4CEBD817A1,id=e693615a-d484-4165-8446-dff08cac5978
> 2024-11-16 00:10:31,039 INFO 
> [grpc-default-executor-22]-org.apache.ratis.server.impl.RoleInfo: 
> e693615a-d484-4165-8446-dff08cac5978: shutdown 
> e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-FollowerState
> 2024-11-16 00:10:31,039 INFO 
> [grpc-default-executor-22]-org.apache.ratis.server.impl.StateMachineUpdater: 
> e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-StateMachineUpdater: 
> set stopIndex = 1613342
> 2024-11-16 00:10:31,039 INFO 
> [e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-FollowerState]-org.apache.ratis.server.impl.FollowerState:
>  e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-FollowerState was 
> interrupted
> 2024-11-16 00:10:31,043 ERROR 
> [e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-StateMachineUpdater]-org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
>  Failed to take snapshot  for group-AF4CEBD817A1 as the stateMachine is 
> unhealthy. The last applied index is at (t:216, i:1613313)
> 2024-11-16 00:10:31,043 ERROR 
> [e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-StateMachineUpdater]-org.apache.ratis.server.impl.StateMachineUpdater:
>  e693615a-d484-4165-8446-dff08cac5978@group-AF4CEBD817A1-StateMachineUpdater: 
> Failed to take snapshot
> org.apache.ratis.protocol.exceptions.StateMachineException: Failed to take 
> snapshot  for group-AF4CEBD817A1 as the stateMachine is unhealthy. The last 
> applied index is at (t:216, i:1613313)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.takeSnapshot(ContainerStateMachine.java:356)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.takeSnapshot(StateMachineUpdater.java:286)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.checkAndTakeSnapshot(StateMachineUpdater.java:278)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:194)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Log entry not found
> {noformat}
> 2024-11-14 01:59:37,516 WARN 
> [7cc563b3-14b5-4334-820b-5c3bbecffad8@group-0C8C280DCAED->67eefe63-0930-42d7-a364-e46fde563ff1-GrpcLogAppender-LogAppenderDaemon]-org.apache.r
> atis.server.leader.LogAppenderDaemon: 
> 7cc563b3-14b5-4334-820b-5c3bbecffad8@group-0C8C280DCAED->67eefe63-0930-42d7-a364-e46fde563ff1-GrpcLogAppender-LogAppenderDaemon
>  faile
> d
> org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: 
> index = 3205
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:301)
>         at 
> org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:240)
>         at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:387)
>         at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:262)
>         at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> HDDS-11720 seems to be related too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2192) Lots of errors after applying RATIS-2129

Reply via email to