[ https://issues.apache.org/jira/browse/HBASE-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang reopened HBASE-24813: ------------------------------- This cause very serious problem since the ReplicationSource thread can not quit and then we flood the our log in UT with {noformat} Interrupting source thread for peer xxx without cleaning buffer usage {noformat} And then blow up the /tmp spaces, as one single std*deferred file(which is used to buffer stdout and stderr for surefire xml report) can be several tens of GBs. This is a very critical problem that cause the jenkins node unavailable so let me revert it first. Please consider fixing the problem before committing again. Thanks a lot. > ReplicationSource should clear buffer usage on ReplicationSourceManager upon > termination > ---------------------------------------------------------------------------------------- > > Key: HBASE-24813 > URL: https://issues.apache.org/jira/browse/HBASE-24813 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 3.0.0-alpha-1, 2.3.1, 2.2.6 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.3, 2.4.0, 2.2.7 > > > Following investigations on the issue described by [~elserj] on HBASE-24779, > we found out that once a peer is removed, thus killing peers related > *ReplicationSource* instance, it may leave > *ReplicationSourceManager.totalBufferUsed* inconsistent. This can happen if > *ReplicationSourceWALReader* had put some entries on its queue to be > processed by *ReplicationSourceShipper,* but the peer removal killed the > shipper before it could process the pending entries. When > *ReplicationSourceWALReader* thread add entries to the queue, it increments > *ReplicationSourceManager.totalBufferUsed* with the sum of the entries sizes. > When those entries are read by *ReplicationSourceShipper,* > *ReplicationSourceManager.totalBufferUsed* is then decreased. We should also > decrease *ReplicationSourceManager.totalBufferUsed* when *ReplicationSource* > is terminated, otherwise those unprocessed entries size would be consuming > *ReplicationSourceManager.totalBufferUsed __*indefinitely, unless the RS gets > restarted. This may be a problem for deployments with multiple peers, or if > new peers are added.** -- This message was sent by Atlassian Jira (v8.3.4#803005)