[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: (was: RATIS-840.004.patch) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Blocker > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > Time Spent: 20m > Remaining Estimate: 0h > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.004.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Blocker > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, RATIS-840.004.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > Time Spent: 10m > Remaining Estimate: 0h > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-840: -- Priority: Blocker (was: Critical) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Blocker > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Summary: Memory leak of LogAppender (was: Use new FollowerInfo in votesList when create new GrpcLogAppender) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.003.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.002.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, > screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.001.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: pom.xml > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not stop the
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: (was: RATIS-840.001.patch) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: (was: pom.xml) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: (was: RATIS-840.001.patch) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.001.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: (was: RATIS-840.001.patch) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.001.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* I find a lot of GrpcLogAppender blocked inside logs4j. I think it's GrpcLogAppender restart so fast, then blocked in logs4j. !screenshot-2.png! *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* I find a lot of GrpcLogAppender blocked inside logs4j. I think it's GrpcLogAppender restart too fast, then blocked in logs4j. !screenshot-2.png! *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* I find a lot of GrpcLogAppender blocked inside LOG.info. I think it's GrpcLogAppender restart so fast, then blocked in LOG.info. !screenshot-2.png! *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* I find a lot of GrpcLogAppender blocked inside LOG.info. I think it's GrpcLogAppender restart so faster, then blocked in LOG.info. !screenshot-2.png! *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: screenshot-2.png > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > Still working. > I need to find where a lot of old GrpcLogAppend threads were blocked. > Because when > [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] > new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed > the > [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. > So the old GrpcLogAppender thread should then stop rather than blocked. > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender,
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* Still working. I need to find where a lot of old GrpcLogAppend threads were blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ? Still working. I need to find where a lot of old GrpcLogAppend threads were blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.001.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > h1. {color:#DE350B}Why GrpcLogAppender did not stop the Daemon Thread when > removed from senders ?{color} > h1. {color:#DE350B}Still working. {color} > I need to find where a lot of old GrpcLogAppend threads were blocked. > Because when > [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] > new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed > the > [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. > So the old GrpcLogAppender thread should then stop rather than blocked. > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! h1. {color:#DE350B}Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?{color} h1. {color:#DE350B}Still working. {color} I need to find where a lot of old GrpcLogAppend threads were blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! {color:#DE350B}h1. Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?{color} h1. {color:#DE350B}Still working. {color} I need to find where a lot of old GrpcLogAppend threads were blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! {color:#DE350B}h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?*{color} h1. {color:#DE350B}Still working. {color} I need to find where a lot of old GrpcLogAppend threads were blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* h1. {color:#DE350B}Still working. {color} I need to find where a lot of old GrpcLogAppend threads were blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?*
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* h1. {color:#DE350B}Still working. {color} I need to find where the GrpcLogAppend thread was blocked. Because when [restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94] new GrpcLogAppend thread , it means the old GrpcLogAppend thread has existed the [runAppenderImpl|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L77]. So the old GrpcLogAppender thread should then stop rather than blocked. *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* h1. {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! h1. *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* h1. {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-840: -- Priority: Critical (was: Major) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > {color:#DE350B}Still working. The previous patch has some problem, and I will > submit it again.{color} > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not stop the old GrpcLogAppender, and the old GrpcLogAppender append
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 3. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} 5. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 6. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. {color:#DE350B}Error happens because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. {color} The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get[0] | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get[0] | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then [LeaderState::updateCommit|] -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then [LeaderState::updateCommit| https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0) | https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -\> [LeaderState::getMajorityMin|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit, [LeaderState::getMajorityMin|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599], [voterLists.get(0)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders, LeaderState::addSenders->RaftServerImpl::newLogAppender, [new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} {color:#DE350B}Can the new GrpcLogAppender work normally ?{color} 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender->[new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} *Can the new GrpcLogAppender work normally ?* 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender->[new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. So the new created GrpcLogAppender can never work normally. 5. The reason of unit test of runTestRestartLogAppender can pass is that it did not stop the old GrpcLogAppender, and the old GrpcLogAppender append entry to follower, not the new GrpcLogAppender. If stop the old GrpcLogAppender, runTestRestartLogAppender will fail. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders ?* {color:#DE350B}Still working. The previous patch has some problem, and I will submit it again.{color} {color:#DE350B}Can the new GrpcLogAppender work normally ?{color} 1. Even though without the above problem, the new created GrpcLogAppender still can not work normally. 2. When the new created GrpcLogAppender append entry to follower, then the follower response SUCCESS. 3. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] -> [voterLists.get(0)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607], error happens. Because voterLists.get(0) return the FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. The new GrpcLogAppender created a new FollowerInfo: LeaderState::addAndStartSenders -> LeaderState::addSenders->RaftServerImpl::newLogAppender->[new FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] 4. Because the majority commit got from the FollowerInfo of the old GrpcLogAppender never changes. So even though follower has append entry successfully, the leader can not update commit. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> RoleInfo::shutdownLeaderState -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> ServerState:close ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> RoleInfo::shutdownLeaderState -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> ServerState:close ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close. !screenshot-1.png! was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> RoleInfo::shutdownLeaderState -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> ServerState:close ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So sometimes, GrpcLogAppender stop after SegmentedRaftLog close, then infinite restart of GrpcLogAppender happens. !screenshot-1.png! > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments:
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> RoleInfo::shutdownLeaderState -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> ServerState:close ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So sometimes, GrpcLogAppender stop after SegmentedRaftLog close, then infinite restart of GrpcLogAppender happens. !screenshot-1.png! was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. !screenshot-1.png! > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: (was: RATIS-840.001.patch) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender. !screenshot-1.png! was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* !screenshot-1.png! > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: screenshot-1.png > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > *What's the reason ?* > From the code, when > [removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call > [LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. > > *How to fix ?* > To avoid forgetting stopAppender, I stopAppender in [SenderList > ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* !screenshot-1.png! was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > !screenshot-1.png! -- This
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times ?* *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].*{color:#DE350B} I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times{color}* {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].*{color:#DE350B} I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times{color}* {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].*{color:#DE350B} I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times{color}* {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated RATIS-840: -- Component/s: server > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].*{color:#DE350B} > I will continue to find the root cause of why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times{color}* > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *What's the reason ?* > From the code, when > [removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call > [LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. > > *How to fix ?* > To avoid forgetting stopAppender, I stopAppender in [SenderList > ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].*{color:#DE350B} I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times{color}* {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].{color:#DE350B}I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times{color} {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].{color:#DE350B}I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times{color} {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]: {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}. I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]: {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}. I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]: {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. Besides, I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to >
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]: {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. Besides, I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. Besides, I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]: > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png!
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in >[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. Besides, I will continue to find the root cause of why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. Besides, I will continue to find the root cause of why [restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the > Daemon Thread when removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *What's the reason ?* > From the code, when > [removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431] > in > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428], > it did not call > [LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. > > *How to fix ?* > To avoid forgetting stopAppender, I stopAppender in [SenderList > ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Besides, I will continue to find the root cause of why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. Besides, I will continue to find the root cause of why [restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] so many times was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the > Daemon Thread when removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *What's the reason ?* > From the code, when > [removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call > [LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. > > *How to fix ?* > To avoid forgetting stopAppender, I stopAppender in [SenderList > ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Besides, I will continue to find the root cause of why > [restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Description: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call >[LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. was: *What's the problem ?* When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders. !image-2020-04-06-14-27-28-485.png! !image-2020-04-06-14-27-39-582.png! *What's the reason ?* >From the code, when >[removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call LogAppender::stopAppender. *How to fix ?* To avoid forgetting stopAppender, I stopAppender in [SenderList ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the > Daemon Thread when removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *What's the reason ?* > From the code, when > [removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call > [LogAppender::stopAppender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L164]. > > *How to fix ?* > To avoid forgetting stopAppender, I stopAppender in [SenderList > ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-840: - Attachment: RATIS-840.001.patch > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Priority: Major > Attachments: RATIS-840.001.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. So there are a lot of GrpcLogAppender did not stop the > Daemon Thread when removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *What's the reason ?* > From the code, when > [removeSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L431], > it did not call LogAppender::stopAppender. > > *How to fix ?* > To avoid forgetting stopAppender, I stopAppender in [SenderList > ::removeAll|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L173]. -- This message was sent by Atlassian Jira (v8.3.4#803005)