[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844457#comment-16844457 ] Feng Yuan commented on HDFS-9239: - Hi [~cnauroth], there is still block issue. org.apache.hadoop.hdfs.server.datanode.BPServiceActor.LifelineSender#sendLifeline bpos.getBlockPoolId readLock,but processcommand hold the writeLock > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth >Priority: Major > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218470#comment-15218470 ] Nathan Roberts commented on HDFS-9239: -- bq. Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here? Yes, those links are talking about the right parameters. We currently run with read_expire=1000, write_expire=1000, and writes_starved=1. Since our I/O workloads change dramatically over time, we didn't spend a lot of time looking for optimal values here. These have been working well for the last several months across multiple clusters. As an aside, a relatively easy way to reproduce this problem, is to put a heavy seek load on all the disks of a datanode (e.g. http://www.linuxinsight.com/how_fast_is_your_disk.html, I believe 5-10 copies of seeker were sufficient.) After a minute or so, system becomes almost unusable and datanode will be declared lost. This might be a good test to run against the lifeline protocol. My hunch is, with CFQ, the datanode will still be lost. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218396#comment-15218396 ] Chris Nauroth commented on HDFS-9239: - [~nroberts], fascinating! Thank you for sharing. Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here? https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Kernel_Boot_Parameters-The_IO_Scheduler.html Also, did you find any relevant additional scheduler tuning configuration was needed, as described here? https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/ch06s04s02.html > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218302#comment-15218302 ] Nathan Roberts commented on HDFS-9239: -- bq. However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked. Just a quick update on this comment. Even after HDFS-7060 we still had cases where Datanodes would fail to heartbeat in. We eventually tracked this down to the RHEL CFQ I/O scheduler. There are situations where significant seek activity (like a massive shuffle) can cause this I/O scheduler to indefinitely starve writers. This eventually causes the datanode and/or nodemanager processes to completely stop (probably due to logging I/O backing up). So, no matter how smart we make these daemons, they are going to be lost from the NN/RM point of view in these situations. But, this is actually probably the right thing to do in these cases, these daemons are clearly not able to do their job so SHOULD be declared lost. In any event, the change which we found most valuable for this situation was to use the deadline I/O scheduler. This dramatically improved the number of lost datanodes and nodemanagers we were seeing. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218270#comment-15218270 ] Chris Nauroth commented on HDFS-9239: - I also am hesitant to backport a patch of this size right now, but I might reconsider after getting more production experience with the feature. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217719#comment-15217719 ] Steve Loughran commented on HDFS-9239: -- I'm nervous about pushing stuff back. At the very least, ship Hadoop 2.8 and see if things break > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217710#comment-15217710 ] Vinayakumar B commented on HDFS-9239: - Even though this Jira along with HDFS-9311 is New Feature, IMO it would be worth to merge to Branch-2.7. Lot of deployments would be benefited. What you think [~kihwal]? If okay, I would be happy to post a patch for branch-2.7, if available patches not apply on branch-2.7. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181353#comment-15181353 ] Hudson commented on HDFS-9239: -- FAILURE: Integrated in Hadoop-trunk-Commit #9426 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9426/]) HDFS-9239. DataNode Lifeline Protocol: an alternative protocol for (cnauroth: rev 2759689d7d23001f007cb0dbe2521de90734dd5c) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolPB.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeRegister.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * hadoop-common-project/hadoop-common/src/site/markdown/Metrics.md * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeLifelineProtocol.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/pom.xml * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBPOfferService.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockPoolManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeLifeline.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/NamenodeProtocols.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolServerSideTranslatorPB.java * hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeLifelineProtocol.proto * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolClientSideTranslatorPB.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180528#comment-15180528 ] Tsz Wo Nicholas Sze commented on HDFS-9239: --- +1 the new patch looks good. Thanks for the update. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180374#comment-15180374 ] Chris Nauroth commented on HDFS-9239: - The test failures are unrelated. The remaining style warnings are not worth addressing. [~szetszwo], would you please take a look at patch v003 and my comments that go with it? Thank you. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180227#comment-15180227 ] Kihwal Lee commented on HDFS-9239: -- Filed HDFS-9905 for TestWebHdfsTimeouts. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179493#comment-15179493 ] Hadoop QA commented on HDFS-9239: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 31s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 32s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 15s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 0s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 50s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:red}-1{color} | {color:red} cc {color} | {color:red} 15m 10s {color} | {color:red} root-jdk1.8.0_74 with JDK v1.8.0_74 generated 1 new + 9 unchanged - 1 fixed = 10 total (was 10) {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 12m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 12m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 11s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 11s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 46s {color} | {color:red} root: patch generated 7 new + 1053 unchanged - 6 fixed = 1060 total (was 1059) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 1s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 11m 14s {color} | {color:red} hadoop-common in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 57s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_74. {color} | |
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177102#comment-15177102 ] Hadoop QA commented on HDFS-9239: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 36s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 50s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 59s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 53s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 7s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 19s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 19s {color} | {color:red} root: patch generated 8 new + 1053 unchanged - 6 fixed = 1061 total (was 1059) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 57s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 6s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 58s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 45s {color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 13s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 3s {color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_95. {color} | |
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176961#comment-15176961 ] Tsz Wo Nicholas Sze commented on HDFS-9239: --- Should the try-catch be restructured like below? {code} @Override public void run() { try { initialRegistrationComplete.await(); while (shouldRun()) { try { if (lifelineNamenode == null) { lifelineNamenode = dn.connectToLifelineNN(lifelineNnAddr); } sendLifelineIfDue(); } catch (IOException e) { LOG.warn("IOException in LifelineSender for " + BPServiceActor.this, e); } Thread.sleep(scheduler.getLifelineWaitTime()); } } catch (InterruptedException e) { LOG.warn("LifelineSender interrupted", e); } LOG.info("LifelineSender for " + BPServiceActor.this + " exiting."); } {code} > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176944#comment-15176944 ] Tsz Wo Nicholas Sze commented on HDFS-9239: --- {code} //LifelineSender +public void run() { + while (shouldRun()) { +try { + initialRegistrationComplete.await(); + break; +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); +} + } {code} If there is an InterruptedException, it probably should rethrow it as an RuntimeException. Calling Thread.currentThread().interrupt() just set thread's interrupt status but the thread will continue running. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176926#comment-15176926 ] Tsz Wo Nicholas Sze commented on HDFS-9239: --- This feature is going to be very useful for busy clusters. Some quick comments: {code} // DatanodeManager.handleLifeline synchronized (heartbeatManager) { synchronized (datanodeMap) { DatanodeDescriptor nodeinfo = getDatanode(nodeReg); ... heartbeatManager.updateLifeline(nodeinfo, reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); } {code} - synchronized (datanodeMap) shoud be synchronized (this). We no longer synchronize on datanodeMap. - Do we need synchronized (heartbeatManager)? heartbeatManager.updateLifeline is already synchronized. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118590#comment-15118590 ] Hadoop QA commented on HDFS-9239: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} | {color:red} HDFS-9239 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12771665/HDFS-9239.001.patch | | JIRA Issue | HDFS-9239 | | Powered by | Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/14257/console | This message was automatically generated. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009762#comment-15009762 ] Chris Nauroth commented on HDFS-9239: - Thanks for the great reviews, everyone. I'll respond to some of the feedback now and update the patch later. bq. What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands. That's an interesting idea. I'll explore this. bq. I'm not sure we need yet another RPC server for this purpose. Just to make sure everyone is aware, the additional optional RPC server is effectively already there due to committing HDFS-9311, which specifically targeted the related problem of ZKFC health check messages getting blocked. I agree that yet another RPC server is not ideal in terms of operational complexity, but I also saw it as the only viable option short-term achievable in the 2.x line. bq. In {{BPOfferServiceActor#run}} we retry await operation on being interrupted. My question is when would it be safe to retry ? In practice, I expect it never will retry. Suppose the thread enters {{await}} on the latch, but gets interrupted before the initial registration completes. The only thing that triggers thread interruption is shutdown of the whole DataNode (either the whole JVM process or a single DataNode inside a {{MiniDFSCluster}}). That means that by the time this thread gets interrupted, internal flags have already been updated such that the {{shouldRun()}} call on the next iteration will return {{false}}. {{run()}} would then return with no further action taken, and the thread would stop. However, there is also no harm done if the {{await}} gets retried. This is a daemon thread, so even if it keeps retrying, it won't block a normal JVM exit. bq. Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode. This might be a good idea, but I'd prefer to treat it as a separate code improvement decoupled from the work here. Right now, there are multiple points in the code that depend on specific lock ordering of {{heartbeatManager}} first followed by {{datanodeMap}} second to prevent deadlock. The current code makes this explicit with nested {{synchronized (X)}} blocks instead of implicit by declaring particular methods {{synchronized}}. Also, HDFS-8966 is making a lot of the changes in the locking here, so I expect making changes now would just create merge conflicts for that effort later. bq. Did you intend to call {{heartbeatManager.updateLifeline}} inside the {{synchronized(datanodeMap)}} or just inside {{synchronized (heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating stats ? This locking was intentional. If we do not hold the lock on {{datanodeMap}} during the get+update, then there is a risk that multiple heartbeats in flight for the same DataNode could cause a lost update, or even an inconsistency where the final state of the {{DatanodeDescriptor}} really contains some stats from one heartbeat and other stats from another heartbeat. I have not observed excessive lock contention here during the issues that prompted me to file this JIRA, so I didn't try hard to optimize this locking away. Some of the work in HDFS-8966 is likely to reduce the locking here anyway. bq. NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon". Are you saying that beyond some skipping threshold, the heartbeat should still be considered a failure, eventually causing the DataNode to be marked stale and then dead? I'm not sure how we'd set such a threshold, given that there is no SLA defined on these operations AFAIK. I'd say there is still value in keeping a DataNode alive in these cases, such as for serving reader activity. bq. It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names. Yes, that's partially correct. I agree that introduction of another RPC server is in some sense a workaround. In fact, I would make the same argument for {{dfs.namenode.servicerpc-address}} in the first place too. Lack of sophisticated QoS drove us to isolate operations to a separate RPC server. ({{dfs.namenode.servicerpc-address}} also can have some side benefits in multi-homed environments that want to dedicate a whole separate network interface to certain operations.) I think a separate RPC server is a viable short-to-mid-term solution in the 2.x line. Longer term, I'd prefer that we evolve towards more sophisticated QoS that prioritizes critical "control plane" activity like heartbeats. However, this goes deeper than just QoS at the RPC layer, because we
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003219#comment-15003219 ] Ming Ma commented on HDFS-9239: --- Sorry for the jumping in late for the discussion. While we haven't seen any recent issues caused by DNs incorrectly marked as dead, maybe this feature could mitigate replication storm issue where incorrectly marked DNs will cause even more replication? * It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names. Image if RPC server can provide differentiated service based on method names, then we can just add {{sendLifeline}} to existing {{DatanodeProtocol}} and have the same RPC server can process the method call at the highest priority. Adding method-based RPC QoS could have help other use cases, for example, if we want to prioritize existing heartbeat over IBR. * Regarding the DN contention scenario which blocks it from sending {{sendLifeline}} to NN, we could skip all info such as storage reports. But if DN is already such state, maybe not sending {{sendLifeline}} is what we want anyway. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002293#comment-15002293 ] Kihwal Lee commented on HDFS-9239: -- bq. NN heartbeat processing with a lockless + tryLock implementation would make it ideally suited for the existing client and/or service servers. NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon". > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001367#comment-15001367 ] Anu Engineer commented on HDFS-9239: [~cnauroth] Without alluding to anything that [~daryn] brought up, Thanks for the patch. It looks very good, some minor comments / questions below. 1. nit : {{BPofferService#Constructor}} Perhaps add a Precondition to make this relationship explicit. something like Preconditions.checkState(lifelineNnAddrs.size() == nnAddrs.size()), since we access both lists with the same index ? 2. More of a question for my own understanding : In {{BPOfferServiceActor#run}} we retry await operation on being interrupted. My question is when would it be safe to retry ? Wanted to understand if there were any scenarios where this can happen. Btw, I do see this as a good coding pattern. {code} while (shouldRun()) { try { initialRegistrationComplete.await(); break; } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } {code} 3. nit: DataNodeManager.java - Javadoc - xmitsInProgress replaced by maxTransfers 4. In DataNodeManager.java : {code} synchronized (datanodeMap) { DatanodeDescriptor nodeinfo = getDatanode(nodeReg); {code} Two comments here : a. Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode. b. Did you intend to call {{heartbeatManager.updateLifeline}} inside the {{synchronized(datanodeMap)}} or just inside {{synchronized (heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating stats ? 5. hdfs-default.xml : nit : Comment : since we rely on ratio as the default you might want to fix the comment which says default is 1. dfs.namenode.lifeline.handler.count Sets number of RPC server threads the NameNode runs for handling the lifeline RPC server. The default value is 1, because this RPC server handles only HA health check requests from ZKFC. These are lightweight > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001314#comment-15001314 ] Daryn Sharp commented on HDFS-9239: --- There are really 2 problems to solve here. Ensuring the DN can actually heartbeat as Kihwal alluded to. Ensuring the NN can process it in a reasonable time. In the DN, our main problems with the DN jamming up and not sending heartbeats were: 1) commands (finalize) not handled async. 2) getting the du/df metrics for the heartbeat blocked because the block layout change paralyzed disks. Although finalize is now async, in the more general sense heartbeats response commands should always be decoupled from the sending of the heartbeat. On the NN, the fsn lock could be a problem but in practice, we've not had it even with over 5k nodes. But I really like the approach of making the heartbeat stat updates fsn-lockless. Collecting the commands w/o the lock (since it doubles as a operational state lock) isn't trivial or you would have done that. What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands. I'm not sure we need yet another RPC server for this purpose. NN heartbeat processing with a lockless + tryLock implementation would make it ideally suited for the existing client and/or service servers. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999709#comment-14999709 ] Chris Nauroth commented on HDFS-9239: - Some pre-requisite code for this was committed in scope of HDFS-9311, so I'm linking the issues. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1452#comment-1452 ] Hadoop QA commented on HDFS-9239: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 27s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 7s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 3s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 44s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 46s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 27s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 5m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 5s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 5m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 5s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 9s {color} | {color:red} Patch generated 10 new checkstyle issues in root (total was 1105, now 1106). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 55s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 42s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 42s {color} | {color:red} hadoop-common in the patch failed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 12s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 40s {color} | {color:red} hadoop-common in the patch failed with JDK v1.7.0_79. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 38s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_79. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 25s {color} | {color:red} Patch generated 56 ASF License warnings. {color} | |
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960776#comment-14960776 ] Kihwal Lee commented on HDFS-9239: -- It may not help much with the namenode side. Even on extremely busy clusters, I have not seen nodes missing heartbeat and considered dead because of the contention among heartbeats, incremental block reports (IBR) and full block reports (FBR). Well before node liveness is affected by inundation of IBRs and FBRs, the namenode performance will degrade to unacceptable level. It is really easy to test this. Create a wide job that creates a lot small files. However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961540#comment-14961540 ] Jitendra Nath Pandey commented on HDFS-9239: bq. .. Well before node liveness is affected by inundation of IBRs and FBRs, the namenode performance will degrade to unacceptable level... Yes, indeed. But if datanodes are marked as dead in that situation, that completely destabilizes the system. At that point, even if we kill certain offending jobs, it takes a while before NN can come back to an acceptable service level. This proposal should help prevent the death after NN is past the overloading scenario. I think ZKFC healthcheck should also be separated into a different queue or port so that they are not blocked by other messages in NN's call queue. A failover because NN is busy is not very helpful. The other NN also gets busy and we end up seeing active-standby flip-flop between the namenodes. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959826#comment-14959826 ] Daryn Sharp commented on HDFS-9239: --- It seems like a good idea at first, but I don't think the proposal solves the stated issues: * This prevents the NameNode from spuriously marking healthy DataNodes as stale or dead. * ... delayed DataNodes may be flagged as stale, and applications may erroneously choose to avoid accessing those nodes * ... DataNodes may be flagged as dead. In extreme cases, this can cause a NameNode to schedule wasteful rereplication activity. Let's say the NN can't service heartbeats to avoid false-staleness (stale defaults to 30s). That means it definitely can't process IBRs either. Would a lifeline to prevent the stale flag matter at this point? At this level of congestion, nearly all of the nodes are going stale. The staleness is probably the least of your worries. If nodes are marked dead from inability to keep up with heartbeats (defaults to ~10min), the cluster itself is already. Worrying about wasted replications is dubious because the NN can't issue replications if it can't process the heartbeats. That is not a heavy load scenario. From personal experience, it sounds like the fallout of a 120GB+ heap stop-the-world GC. The NN wakes up, heartbeat monitor starts marking everything dead. This sparks a replication storm, followed by invalidation storm, which the NN recovers from... unless it goes into another full GC. The lifeline might help slow the rise of false-dead nodes. However, I recently patched the heartbeat monitor to detect long GCs and be very gracious before marking nodes dead. If I've misinterpreted anything, please describe the incident that prompted this approach so we can see if it would have helped. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957150#comment-14957150 ] Chris Nauroth commented on HDFS-9239: - I briefly considered UDP for that very reason. However, there isn't a strong precedent for UDP in Hadoop right now, aside from a few optional components like the HDFS NFS gateway and the Ganglia and StatsD metrics sinks. I'm reluctant to add UDP delivery troubleshooting to the mix of operational concerns that administrators need to know to support core functionality. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956523#comment-14956523 ] Steve Loughran commented on HDFS-9239: -- Does this actually need RPC? Or could some UDP packet be submitted, with the recipient doing an initial auth check of the sender, then queuing it to update the internal state. After all, this is intended to be one-way announcements > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)