[jira] [Created] (RATIS-856) Install Snapshot notifications should be fanned out

2020-04-17 Thread Hanisha Koneru (Jira)
Hanisha Koneru created RATIS-856:


 Summary: Install Snapshot notifications should be fanned out
 Key: RATIS-856
 URL: https://issues.apache.org/jira/browse/RATIS-856
 Project: Ratis
  Issue Type: Improvement
Reporter: Hanisha Koneru
Assignee: Hanisha Koneru


When InstallSnapshot is disabled and Ratis logs are purged, Leader sends 
InstallSnapshot notification to Follower. Follower then tells its State Machine 
to install the snapshot.

 
We should give time for the Follower State Machine to download and install the 
snapshot. So instead of sending installSnapshot notification for each 
heartbeat, it would be better if there is a time gap between the sending the 
notifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-850) Allow log purge up to snapshot index

2020-04-17 Thread Hanisha Koneru (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085970#comment-17085970
 ] 

Hanisha Koneru commented on RATIS-850:
--

Unit tests pass locally.


[~msingh] , [~lokeshjain] can you please take a look. Thanks.

> Allow log purge up to snapshot index
> 
>
> Key: RATIS-850
> URL: https://issues.apache.org/jira/browse/RATIS-850
> Project: Ratis
>  Issue Type: Improvement
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Major
> Attachments: RATIS-850.001.patch, RATIS-850.002.patch
>
>
> Ratis logs are purged only up to the least commit index on all the peers. But 
> if one peer is down, it stop log purging on all the peers. If the Ratis 
> server takes snapshots, then we can purge logs up to the snapshot index even 
> if some peer has not committed up to that index. When the peer rejoins the 
> ring, instead of ratis logs, it can get the snapshot to catch up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-835) Include exception based attempt count in raft client request

2020-04-17 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085904#comment-17085904
 ] 

Tsz-wo Sze commented on RATIS-835:
--

The 004 patch looks good.  Just a minor comment:
- In  RaftClientImpl.sendRequestWithRetry, the dummy pending local variable is 
not needed.  Just use 1
{code}
  final int exceptionCount = ioe != null ? 1 : 0;
  final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, ioe);
{code}


> Include exception based attempt count in raft client request
> 
>
> Key: RATIS-835
> URL: https://issues.apache.org/jira/browse/RATIS-835
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-835.001.patch, RATIS-835.002.patch, 
> RATIS-835.003.patch, RATIS-835.004.patch
>
>
> Client needs to maintain exception based attempt count for using Exception 
> Dependent retry policy. Exception dependent policy helps in specifying 
> individual policies for different exception types.
> Currently policy takes number of attempts as argument. Therefore the 
> individual policies require attempt counts for the particular exception while 
> handling retry event. This is particularly important for using 
> MulipleLinearRandomRetry policy which increases sleep interval based on 
> number of attempts made by the client. Raft Client can therefore use this 
> policy for ResourceUnavailableException and increase sleep interval for 
> subsequent retries of the request on the same exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (RATIS-737) Release Ratis 0.3.0 Thirdparty

2020-04-17 Thread Mukul Kumar Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh resolved RATIS-737.
-
Fix Version/s: 0.3.0
   Resolution: Fixed

This is already released, resolving this.

> Release Ratis 0.3.0 Thirdparty
> --
>
> Key: RATIS-737
> URL: https://issues.apache.org/jira/browse/RATIS-737
> Project: Ratis
>  Issue Type: Bug
>  Components: thirdparty
>Reporter: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-855) Release Ratis 0.4.0 Thirdparty

2020-04-17 Thread Mukul Kumar Singh (Jira)
Mukul Kumar Singh created RATIS-855:
---

 Summary: Release Ratis 0.4.0 Thirdparty
 Key: RATIS-855
 URL: https://issues.apache.org/jira/browse/RATIS-855
 Project: Ratis
  Issue Type: Bug
  Components: thirdparty
Reporter: Mukul Kumar Singh
Assignee: Mukul Kumar Singh


with RATIS-852 and RATIS-847 resolved, release 0.4.0 third party



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-840:
-
Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

h1. {color:#DE350B}Why GrpcLogAppender did not stop the Daemon Thread when 
removed from senders ?{color}
h1. {color:#DE350B}Still working. {color}
I need to find where a lot of  old GrpcLogAppend threads were blocked. Because 
when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
3. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
 {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo 
of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
{color}
5. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
6. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.


  was:

[jira] [Updated] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-840:
-
Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

{color:#DE350B}h1. Why GrpcLogAppender did not stop the Daemon Thread when 
removed from senders ?{color}
h1. {color:#DE350B}Still working. {color}
I need to find where a lot of  old GrpcLogAppend threads were blocked. Because 
when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
3. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
 {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo 
of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
{color}
5. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
6. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.


  was:

[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085463#comment-17085463
 ] 

runzhiwang edited comment on RATIS-840 at 4/17/20, 6:10 AM:


Please wait for me. I need to find where a lot of  old GrpcLogAppend threads 
were blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.


was (Author: yjxxtd):
Please wait for me. I need to find where a lot of  old GrpcLogAppend threads 
were blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
> senders ?*
> h1. {color:#DE350B}Still working. {color}
> I need to find where the GrpcLogAppend thread was blocked. Because when 
> [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
>  new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
> the 
> 

[jira] [Updated] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-840:
-
Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

{color:#DE350B}h1. *Why GrpcLogAppender did not stop the Daemon Thread when 
removed from senders ?*{color}
h1. {color:#DE350B}Still working. {color}
I need to find where a lot of  old GrpcLogAppend threads were blocked. Because 
when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
3. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
 {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo 
of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
{color}
5. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
6. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.


  was:

[jira] [Updated] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-840:
-
Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
senders ?*
h1. {color:#DE350B}Still working. {color}
I need to find where a lot of  old GrpcLogAppend threads were blocked. Because 
when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
3. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
 {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo 
of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
{color}
5. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
6. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.


  was:
*What's the problem ?*


[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085463#comment-17085463
 ] 

runzhiwang edited comment on RATIS-840 at 4/17/20, 6:10 AM:


Please wait for me. I need to find where a lot of  old GrpcLogAppend threads 
were blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.


was (Author: yjxxtd):
Please wait for me. I need to find where a lot of  GrpcLogAppend threads were 
blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
> senders ?*
> h1. {color:#DE350B}Still working. {color}
> I need to find where the GrpcLogAppend thread was blocked. Because when 
> [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
>  new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
> the 
> 

[jira] [Commented] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085463#comment-17085463
 ] 

runzhiwang commented on RATIS-840:
--

Please wait for me. I need to find where a lot of  GrpcLogAppend threads was 
blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
> senders ?*
> h1. {color:#DE350B}Still working. {color}
> I need to find where the GrpcLogAppend thread was blocked. Because when 
> [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
>  new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
> the 
> [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
>  So the old GrpcLogAppender thread should then stop rather than blocked.
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to 

[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085463#comment-17085463
 ] 

runzhiwang edited comment on RATIS-840 at 4/17/20, 6:09 AM:


Please wait for me. I need to find where a lot of  GrpcLogAppend threads 
wereblocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.


was (Author: yjxxtd):
Please wait for me. I need to find where a lot of  GrpcLogAppend threads was 
blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
> senders ?*
> h1. {color:#DE350B}Still working. {color}
> I need to find where the GrpcLogAppend thread was blocked. Because when 
> [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
>  new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
> the 
> [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
>  

[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085463#comment-17085463
 ] 

runzhiwang edited comment on RATIS-840 at 4/17/20, 6:09 AM:


Please wait for me. I need to find where a lot of  GrpcLogAppend threads were 
blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.


was (Author: yjxxtd):
Please wait for me. I need to find where a lot of  GrpcLogAppend threads 
wereblocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
> senders ?*
> h1. {color:#DE350B}Still working. {color}
> I need to find where the GrpcLogAppend thread was blocked. Because when 
> [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
>  new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
> the 
> [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
>  

[jira] [Updated] (RATIS-840) Memory leak of LogAppender

2020-04-17 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-840:
-
Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from 
senders ?*
h1. {color:#DE350B}Still working. {color}
I need to find where the GrpcLogAppend thread was blocked. Because when 
[restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94]
 new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed 
the 
[runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77].
 So the old GrpcLogAppender thread should then stop rather than blocked.

*Can the new GrpcLogAppender work normally ?*
1. Even though without the above problem, the new created GrpcLogAppender still 
can not work normally. 
2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
LeaderState::addAndStartSenders -> 
LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
3. When the new created GrpcLogAppender append entry to follower, then the 
follower response SUCCESS.
4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
 -> 
[voterLists.get(0) | 
https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
 {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo 
of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. 
{color}
5. Because the majority commit got from the FollowerInfo of the old 
GrpcLogAppender never changes. So even though follower has append entry 
successfully, the leader can not update commit. So the new created 
GrpcLogAppender can never work normally.
6. The reason of unit test of runTestRestartLogAppender can pass is that it did 
not stop the old GrpcLogAppender, and  the old GrpcLogAppender append entry to 
follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, 
runTestRestartLogAppender will fail.


  was:
*What's the problem ?*

 When run