[ https://issues.apache.org/jira/browse/HDFS-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aaron T. Myers updated HDFS-4656: --------------------------------- Attachment: HDFS-4656.patch Trivial patch attached. To test this I added some extra logging both to the heartbeat loop itself and inside the {{if}} block mentioned in the issue description. Without the patch I saw output like this: {noformat} 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:37 INFO datanode.DataNode: ====> actually performing heartbeat 13/04/01 17:20:37 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_2854604910635513445_147059 13/04/01 17:20:38 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-7126490035827800141_147054 13/04/01 17:20:39 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-7998009805522854247_147131 13/04/01 17:20:40 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:40 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:40 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:40 INFO datanode.DataNode: ====> actually performing heartbeat 13/04/01 17:20:40 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-7289451447776161689_147106 13/04/01 17:20:41 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-3895917754682539115_147098 13/04/01 17:20:42 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_1182110057394126316_147123 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:20:43 INFO datanode.DataNode: ====> actually performing heartbeat {noformat} With the patch I see output like this: {noformat} 13/04/01 17:22:45 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:22:45 INFO datanode.DataNode: ====> actually performing heartbeat 13/04/01 17:22:46 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-5664468450728548846_148227 13/04/01 17:22:47 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_5505984176030032691_148190 13/04/01 17:22:48 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-847450075981641452_148208 13/04/01 17:22:48 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:22:48 INFO datanode.DataNode: ====> actually performing heartbeat 13/04/01 17:22:49 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-5755422789655200141_148209 13/04/01 17:22:50 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-4066581005605116272_148148 13/04/01 17:22:51 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-3643658690040746645_148232 13/04/01 17:22:51 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:22:51 INFO datanode.DataNode: ====> actually performing heartbeat 13/04/01 17:22:52 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_7451729659994128063_148176 13/04/01 17:22:53 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_7633597047916157767_148158 13/04/01 17:22:54 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_8386436294983826146_148194 13/04/01 17:22:54 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:22:54 INFO datanode.DataNode: ====> actually performing heartbeat 13/04/01 17:22:55 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-4489059851090749136_148224 13/04/01 17:22:56 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-5331285809567165401_148151 13/04/01 17:22:57 INFO datanode.BlockPoolSliceScanner: Verification succeeded for BP-193129671-172.21.3.106-1357591728674:blk_-4611051254925118595_148196 13/04/01 17:22:57 INFO datanode.DataNode: ====> heartbeat loop... 13/04/01 17:22:57 INFO datanode.DataNode: ====> actually performing heartbeat {noformat} No tests are included since the patch is trivial and would be difficult to test in a non-convoluted manner. > DN heartbeat loop can be briefly tight > -------------------------------------- > > Key: HDFS-4656 > URL: https://issues.apache.org/jira/browse/HDFS-4656 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.0.4-alpha > Reporter: Aaron T. Myers > Assignee: Aaron T. Myers > Priority: Minor > Attachments: HDFS-4656.patch > > > The DN hearbeat loop looks roughly like this: > {code} > if (now - timeOfLastHeartbeat > configuredHeartbeatInterval) { > // do heartbeat > } > timeToWait = configuredHeartbeatInterval - (now - timeOfLastHeartbeat) > sleep(timeToWait) > {code} > The trouble is that since we sleep for exactly the heartbeat interval, and > then check to see if we have waited _more_ than that heartbeat interval, we > will very often have waited exactly the heartbeat interval (in millis), and > not more than it. In this case we will skip actually performing the heartbeat > and will calculcate timeToWait as being 0ms. The DN heartbeat loop will then > loop tightly for 1ms. The solution is just to change the "{{>}}" in the code > above to "{{>=}}". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira