[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-840:
-----------------------------
    Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

*Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?*
{color:#DE350B}Still working. The previous patch has some problem, and I will 
submit it again.{color}

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607],
 error happens. Because voterLists.get(0) return the FollowerInfo of the old 
GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
The new GrpcLogAppender created a new FollowerInfo: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender, [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
4. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
5. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.


  was:
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

*Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?*
{color:#DE350B}Still working. The previous patch has some problem, and I will 
submit it again.{color}

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607],
 error happens. Because voterLists.get(0) return the FollowerInfo of the old 
GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
The new GrpcLogAppender created a new FollowerInfo: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender, [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
4. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
5. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.



> Memory leak of LogAppender
> --------------------------
>
>                 Key: RATIS-840
>                 URL: https://issues.apache.org/jira/browse/RATIS-840
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
> {color:#DE350B}Still working. The previous patch has some problem, and I will 
> submit it again.{color}
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607],
>  error happens. Because voterLists.get(0) return the FollowerInfo of the old 
> GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
> The new GrpcLogAppender created a new FollowerInfo: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender, [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 4. Because the majority commit got from the FollowerInfo of the old 
> GrpcLogAppender never changes. So even though follower has append entry 
> successfully, the leader can not update commit. So the new created 
> GrpcLogAppender can never work normally.
> 5. The reason of unit test of runTestRestartLogAppender can pass is that it 
> did not stop the old GrpcLogAppender, and  the old GrpcLogAppender append 
> entry to follower, not the new GrpcLogAppender. If stop the old 
> GrpcLogAppender, runTestRestartLogAppender will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to