Daryn Sharp created HDFS-8776: --------------------------------- Summary: Decom manager should not be active on standby Key: HDFS-8776 URL: https://issues.apache.org/jira/browse/HDFS-8776 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Daryn Sharp
The decommission manager should not be actively processing on the standby. The decomm manager goes through the costly computation for determining every block on the node requires replication yet doesn't queue them for replication - because it's in standby. The decomm manager is holding the namesystem write lock, causing DNs to timeout on heartbeats or IBRs, NN purges the call queue of timed out clients, NN processes some heartbeats/IBRs before the decomm manager locks up the namesystem again. Nodes attempting to register will be sending full BRs which are more costly to send and discard than a heartbeat. If a failover is required, the standby will likely have to struggle very hard to not GC while "catching up" on its queued IBRs while DNs continue to fill the call queue and time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)