[ https://issues.apache.org/jira/browse/HDFS-9305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated HDFS-9305: ----------------------------- Fix Version/s: 2.8.0 > Delayed heartbeat processing causes storm of subsequent heartbeats > ------------------------------------------------------------------ > > Key: HDFS-9305 > URL: https://issues.apache.org/jira/browse/HDFS-9305 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.1 > Reporter: Chris Nauroth > Assignee: Arpit Agarwal > Fix For: 2.8.0, 2.7.2, 3.0.0-alpha1 > > Attachments: HDFS-9305.01.patch, HDFS-9305.02.patch > > > A DataNode typically sends a heartbeat to the NameNode every 3 seconds. We > expect heartbeat handling to complete relatively quickly. However, if > something unexpected causes heartbeat processing to get blocked, such as a > long GC or heavy lock contention within the NameNode, then heartbeat > processing would be delayed. After recovering from this delay, the DataNode > then starts sending a storm of heartbeat messages in a tight loop. In a > large cluster with many DataNodes, this storm of heartbeat messages could > cause harmful load on the NameNode and make overall cluster recovery more > difficult. > The bug appears to be caused by incorrect timekeeping inside > {{BPServiceActor}}. The next heartbeat time is always calculated as a delta > from the previous heartbeat time, without any compensation for possible long > latency on an individual heartbeat RPC. The only mitigation would be > restarting all DataNodes to force a reset of the heartbeat schedule, or > simply wait out the storm until the scheduling catches up and corrects itself. > This problem would not manifest after a NameNode restart. In that case, the > NameNode would respond to the first heartbeat by telling the DataNode to > re-register, and {{BPServiceActor#reRegister}} would reset the heartbeat > schedule to the current time. I believe the problem would only manifest if > the NameNode process kept alive, but processed heartbeats unexpectedly slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org