[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2019-05-20 Thread Feng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844457#comment-16844457
 ] 

Feng Yuan commented on HDFS-9239:
-

Hi [~cnauroth], there is still block issue.

org.apache.hadoop.hdfs.server.datanode.BPServiceActor.LifelineSender#sendLifeline
bpos.getBlockPoolId readLock,but processcommand hold the writeLock

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218470#comment-15218470
 ] 

Nathan Roberts commented on HDFS-9239:
--

bq. Just to make sure I'm clear, are you talking about configuring the deadline 
scheduler as described here?

Yes, those links are talking about the right parameters. 

We currently run with read_expire=1000, write_expire=1000, and 
writes_starved=1. Since our I/O workloads change dramatically over time, we 
didn't spend a lot of time looking for optimal values here. These have been 
working well for the last several months across multiple clusters.

As an aside, a relatively easy way to reproduce this problem, is to put a heavy 
seek load on all the disks of a datanode (e.g. 
http://www.linuxinsight.com/how_fast_is_your_disk.html, I believe 5-10 copies 
of seeker were sufficient.) After a minute or so, system becomes almost 
unusable and datanode will be declared lost. This might be a good test to run 
against the lifeline protocol. My hunch is, with CFQ, the datanode will still 
be lost. 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218396#comment-15218396
 ] 

Chris Nauroth commented on HDFS-9239:
-

[~nroberts], fascinating!  Thank you for sharing.  Just to make sure I'm clear, 
are you talking about configuring the deadline scheduler as described here?

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Kernel_Boot_Parameters-The_IO_Scheduler.html

Also, did you find any relevant additional scheduler tuning configuration was 
needed, as described here?

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/ch06s04s02.html


> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218302#comment-15218302
 ] 

Nathan Roberts commented on HDFS-9239:
--

bq. However,making it lighter on the datanode side is a good idea. We have seen 
many cases where nodes are declared dead because the service actor thread is 
delayed/blocked. 

Just a quick update on this comment. Even after HDFS-7060 we still had cases 
where Datanodes would fail to heartbeat in. We eventually tracked this down to 
the RHEL CFQ I/O scheduler. There are situations where significant seek 
activity (like a massive shuffle) can cause this I/O scheduler to indefinitely 
starve writers. This eventually causes the datanode and/or nodemanager 
processes to completely stop (probably due to logging I/O backing up). So, no 
matter how smart we make these daemons, they are going to be lost from the 
NN/RM point of view in these situations. But, this is actually probably the 
right thing to do in these cases, these daemons are clearly not able to do 
their job so SHOULD be declared lost. 

In any event, the change which we found most valuable for this situation was to 
use the deadline I/O scheduler. This dramatically improved the number of lost 
datanodes and nodemanagers we were seeing.
 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218270#comment-15218270
 ] 

Chris Nauroth commented on HDFS-9239:
-

I also am hesitant to backport a patch of this size right now, but I might 
reconsider after getting more production experience with the feature.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217719#comment-15217719
 ] 

Steve Loughran commented on HDFS-9239:
--

I'm nervous about pushing stuff back. At the very least, ship Hadoop 2.8 and 
see if things break

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217710#comment-15217710
 ] 

Vinayakumar B commented on HDFS-9239:
-

Even though this Jira along with HDFS-9311 is New Feature, IMO it would be 
worth to merge to Branch-2.7. Lot of deployments would be benefited.
What you think [~kihwal]?
If okay, I would be happy to post a patch for branch-2.7, if available patches 
not apply on branch-2.7.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181353#comment-15181353
 ] 

Hudson commented on HDFS-9239:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9426 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9426/])
HDFS-9239. DataNode Lifeline Protocol: an alternative protocol for (cnauroth: 
rev 2759689d7d23001f007cb0dbe2521de90734dd5c)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolPB.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeRegister.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* hadoop-common-project/hadoop-common/src/site/markdown/Metrics.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeLifelineProtocol.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* hadoop-hdfs-project/hadoop-hdfs/pom.xml
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockPoolManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeLifeline.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/NamenodeProtocols.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolServerSideTranslatorPB.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeLifelineProtocol.proto
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolClientSideTranslatorPB.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java


> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180528#comment-15180528
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9239:
---

+1 the new patch looks good.  Thanks for the update.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180374#comment-15180374
 ] 

Chris Nauroth commented on HDFS-9239:
-

The test failures are unrelated.  The remaining style warnings are not worth 
addressing.  [~szetszwo], would you please take a look at patch v003 and my 
comments that go with it?  Thank you.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180227#comment-15180227
 ] 

Kihwal Lee commented on HDFS-9239:
--

Filed HDFS-9905 for TestWebHdfsTimeouts.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179493#comment-15179493
 ] 

Hadoop QA commented on HDFS-9239:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 31s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
58s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 
32s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
15s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 44s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
38s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 0s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
59s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 
50s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:red}-1{color} | {color:red} cc {color} | {color:red} 15m 10s {color} | 
{color:red} root-jdk1.8.0_74 with JDK v1.8.0_74 generated 1 new + 9 unchanged - 
1 fixed = 10 total (was 10) {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 12m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 12m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
11s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 11s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 11s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 46s 
{color} | {color:red} root: patch generated 7 new + 1053 unchanged - 6 fixed = 
1060 total (was 1059) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
42s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 1s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 11m 14s {color} 
| {color:red} hadoop-common in the patch failed with JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 57s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_74. {color} |
| 

[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177102#comment-15177102
 ] 

Hadoop QA commented on HDFS-9239:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
49s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 36s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
25s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
31s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 59s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 53s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 7s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 19s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 19s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 19s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 19s 
{color} | {color:red} root: patch generated 8 new + 1053 unchanged - 6 fixed = 
1061 total (was 1059) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 56s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
27s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
57s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 58s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 45s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_72. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 13s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 3s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_95. 
{color} |
| 

[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176961#comment-15176961
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9239:
---

Should the try-catch be restructured like below?
{code}
@Override
public void run() {
  try {
initialRegistrationComplete.await();

while (shouldRun()) {
  try {
if (lifelineNamenode == null) {
  lifelineNamenode = dn.connectToLifelineNN(lifelineNnAddr);
}
sendLifelineIfDue();
  } catch (IOException e) {
LOG.warn("IOException in LifelineSender for " + 
BPServiceActor.this, e);
  }
  Thread.sleep(scheduler.getLifelineWaitTime());
}
  } catch (InterruptedException e) {
LOG.warn("LifelineSender interrupted", e);
  }

  LOG.info("LifelineSender for " + BPServiceActor.this + " exiting.");
}
{code}

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176944#comment-15176944
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9239:
---

{code}
//LifelineSender
+public void run() {
+  while (shouldRun()) {
+try {
+  initialRegistrationComplete.await();
+  break;
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+}
+  }
{code}
If there is an InterruptedException, it probably should rethrow it as an 
RuntimeException.  Calling Thread.currentThread().interrupt() just set thread's 
interrupt status but the thread will continue running.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176926#comment-15176926
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9239:
---

This feature is going to be very useful for busy clusters.

Some quick comments:
{code}
// DatanodeManager.handleLifeline
synchronized (heartbeatManager) {
  synchronized (datanodeMap) {
DatanodeDescriptor nodeinfo = getDatanode(nodeReg);
...
heartbeatManager.updateLifeline(nodeinfo, reports, cacheCapacity,
cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary);
  }
{code}
- synchronized (datanodeMap) shoud be synchronized (this).  We no longer 
synchronize on datanodeMap.
- Do we need synchronized (heartbeatManager)?  heartbeatManager.updateLifeline 
is already synchronized.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118590#comment-15118590
 ] 

Hadoop QA commented on HDFS-9239:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} 
| {color:red} HDFS-9239 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12771665/HDFS-9239.001.patch |
| JIRA Issue | HDFS-9239 |
| Powered by | Apache Yetus 0.2.0-SNAPSHOT   http://yetus.apache.org |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/14257/console |


This message was automatically generated.



> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-17 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009762#comment-15009762
 ] 

Chris Nauroth commented on HDFS-9239:
-

Thanks for the great reviews, everyone.  I'll respond to some of the feedback 
now and update the patch later.

bq. What you could do after the stat update is use tryLock for a short time. If 
you can't get the lock, oh well, this heartbeat response doesn't get any 
commands.

That's an interesting idea.  I'll explore this.

bq. I'm not sure we need yet another RPC server for this purpose.

Just to make sure everyone is aware, the additional optional RPC server is 
effectively already there due to committing HDFS-9311, which specifically 
targeted the related problem of ZKFC health check messages getting blocked.  I 
agree that yet another RPC server is not ideal in terms of operational 
complexity, but I also saw it as the only viable option short-term achievable 
in the 2.x line.

bq. In {{BPOfferServiceActor#run}} we retry await operation on being 
interrupted. My question is when would it be safe to retry ?

In practice, I expect it never will retry.  Suppose the thread enters {{await}} 
on the latch, but gets interrupted before the initial registration completes.  
The only thing that triggers thread interruption is shutdown of the whole 
DataNode (either the whole JVM process or a single DataNode inside a 
{{MiniDFSCluster}}).  That means that by the time this thread gets interrupted, 
internal flags have already been updated such that the {{shouldRun()}} call on 
the next iteration will return {{false}}.  {{run()}} would then return with no 
further action taken, and the thread would stop.

However, there is also no harm done if the {{await}} gets retried.  This is a 
daemon thread, so even if it keeps retrying, it won't block a normal JVM exit.

bq. Just wondering if it makes sense to move synchronized(datanodeMap) into 
getDataNode.

This might be a good idea, but I'd prefer to treat it as a separate code 
improvement decoupled from the work here.  Right now, there are multiple points 
in the code that depend on specific lock ordering of {{heartbeatManager}} first 
followed by {{datanodeMap}} second to prevent deadlock.  The current code makes 
this explicit with nested {{synchronized (X)}} blocks instead of implicit by 
declaring particular methods {{synchronized}}.  Also, HDFS-8966 is making a lot 
of the changes in the locking here, so I expect making changes now would just 
create merge conflicts for that effort later.

bq. Did you intend to call {{heartbeatManager.updateLifeline}} inside the 
{{synchronized(datanodeMap)}} or just inside {{synchronized 
(heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating 
stats ?

This locking was intentional.  If we do not hold the lock on {{datanodeMap}} 
during the get+update, then there is a risk that multiple heartbeats in flight 
for the same DataNode could cause a lost update, or even an inconsistency where 
the final state of the {{DatanodeDescriptor}} really contains some stats from 
one heartbeat and other stats from another heartbeat.  I have not observed 
excessive lock contention here during the issues that prompted me to file this 
JIRA, so I didn't try hard to optimize this locking away.  Some of the work in 
HDFS-8966 is likely to reduce the locking here anyway.

bq. NN should still enforce a max number of skips and guarantee commands are 
sent in bounded time. Replication or block recovery is done through an 
asynchronous protocol, but oftentimes clients expect them to be done "soon".

Are you saying that beyond some skipping threshold, the heartbeat should still 
be considered a failure, eventually causing the DataNode to be marked stale and 
then dead?  I'm not sure how we'd set such a threshold, given that there is no 
SLA defined on these operations AFAIK.  I'd say there is still value in keeping 
a DataNode alive in these cases, such as for serving reader activity.

bq. It seems the introduction of a new RPC server is to work around the 
existing functionality of RPC which only support QoS based on user names.

Yes, that's partially correct.

I agree that introduction of another RPC server is in some sense a workaround.  
In fact, I would make the same argument for {{dfs.namenode.servicerpc-address}} 
in the first place too.  Lack of sophisticated QoS drove us to isolate 
operations to a separate RPC server.  ({{dfs.namenode.servicerpc-address}} also 
can have some side benefits in multi-homed environments that want to dedicate a 
whole separate network interface to certain operations.)

I think a separate RPC server is a viable short-to-mid-term solution in the 2.x 
line.  Longer term, I'd prefer that we evolve towards more sophisticated QoS 
that prioritizes critical "control plane" activity like heartbeats.  However, 
this goes deeper than just QoS at the RPC layer, because we 

[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-12 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003219#comment-15003219
 ] 

Ming Ma commented on HDFS-9239:
---

Sorry for the jumping in late for the discussion. While we haven't seen any 
recent issues caused by DNs incorrectly marked as dead, maybe this feature 
could mitigate replication storm issue where incorrectly marked DNs will cause 
even more replication?

* It seems the introduction of a new RPC server is to work around the existing 
functionality of RPC which only support QoS based on user names. Image if RPC 
server can provide differentiated service based on method names, then we can 
just add {{sendLifeline}} to existing {{DatanodeProtocol}} and have the same 
RPC server can process the method call at the highest priority. Adding 
method-based RPC QoS could have help other use cases, for example, if we want 
to prioritize existing heartbeat over IBR.
* Regarding the DN contention scenario which blocks it from sending 
{{sendLifeline}} to NN, we could skip all info such as storage reports. But if 
DN is already such state, maybe not sending {{sendLifeline}} is what we want 
anyway.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-12 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002293#comment-15002293
 ] 

Kihwal Lee commented on HDFS-9239:
--

bq. NN heartbeat processing with a lockless + tryLock implementation would make 
it ideally suited for the existing client and/or service servers.
NN should still enforce a max number of skips and guarantee commands are sent 
in bounded time. Replication or block recovery is done through an asynchronous 
protocol, but oftentimes clients expect them to be done "soon". 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-11 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001367#comment-15001367
 ] 

Anu Engineer commented on HDFS-9239:


[~cnauroth] Without alluding to anything that [~daryn] brought up, Thanks for 
the patch. It looks very good, some minor comments / questions below. 

1.  nit : {{BPofferService#Constructor}} 
Perhaps add a Precondition to make this relationship explicit.  something like 
Preconditions.checkState(lifelineNnAddrs.size() == nnAddrs.size()), since we 
access both lists with the same index ? 

2. More of a question for my own understanding :
In  {{BPOfferServiceActor#run}} we retry await operation on being interrupted. 
My question is when would it be safe to retry ? Wanted to understand if there 
were any scenarios where this can happen. Btw, I do see this as a good coding 
pattern.

{code}
while (shouldRun()) {
try {
  initialRegistrationComplete.await();
  break;
} catch (InterruptedException e) {
  Thread.currentThread().interrupt();
}
  }
{code}


3. nit: DataNodeManager.java - Javadoc - xmitsInProgress replaced by 
maxTransfers

4. In DataNodeManager.java : 
{code}
  synchronized (datanodeMap) {
DatanodeDescriptor nodeinfo = getDatanode(nodeReg);
{code}
Two comments here :
  a. Just wondering if it makes sense to move synchronized(datanodeMap) into 
getDataNode.
 b. Did you intend to call {{heartbeatManager.updateLifeline}} inside the 
{{synchronized(datanodeMap)}} or just inside {{synchronized 
(heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating 
stats ? 

5. hdfs-default.xml : nit  : Comment :  since we rely on ratio as the default 
you might want to fix the comment which says default is 1.
  
  dfs.namenode.lifeline.handler.count
  
  
Sets number of RPC server threads the NameNode runs for handling the
lifeline RPC server.  The default value is 1, because this RPC server
handles only HA health check requests from ZKFC.  These are lightweight


> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-11 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001314#comment-15001314
 ] 

Daryn Sharp commented on HDFS-9239:
---

There are really 2 problems to solve here.  Ensuring the DN can actually 
heartbeat as Kihwal alluded to.  Ensuring the NN can process it in a reasonable 
time.

In the DN, our main problems with the DN jamming up and not sending heartbeats 
were: 1) commands (finalize) not handled async.  2) getting the du/df metrics 
for the heartbeat blocked because the block layout change paralyzed disks.  
Although finalize is now async, in the more general sense heartbeats response 
commands should always be decoupled from the sending of the heartbeat.

On the NN, the fsn lock could be a problem but in practice, we've not had it 
even with over 5k nodes.  But I really like the approach of making the 
heartbeat stat updates fsn-lockless.  Collecting the commands w/o the lock 
(since it doubles as a operational state lock) isn't trivial or you would have 
done that.  What you could do after the stat update is use tryLock for a short 
time.  If you can't get the lock, oh well, this heartbeat response doesn't get 
any commands.

I'm not sure we need yet another RPC server for this purpose.  NN heartbeat 
processing with a lockless + tryLock implementation would make it ideally 
suited for the existing client and/or service servers.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-10 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999709#comment-14999709
 ] 

Chris Nauroth commented on HDFS-9239:
-

Some pre-requisite code for this was committed in scope of HDFS-9311, so I'm 
linking the issues.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-11-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1452#comment-1452
 ] 

Hadoop QA commented on HDFS-9239:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 7s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 3s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 44s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 46s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
35s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 27s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 5m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 5s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 5m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 5s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 9s 
{color} | {color:red} Patch generated 10 new checkstyle issues in root (total 
was 1105, now 1106). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 55s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
35s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
46s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 42s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 42s {color} 
| {color:red} hadoop-common in the patch failed with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 12s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 40s {color} 
| {color:red} hadoop-common in the patch failed with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 38s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 25s 
{color} | {color:red} Patch generated 56 ASF License warnings. {color} |
| 

[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-10-16 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960776#comment-14960776
 ] 

Kihwal Lee commented on HDFS-9239:
--

It may not help much with the namenode side. Even on extremely busy clusters, I 
have not seen nodes missing heartbeat and considered dead because of the 
contention among heartbeats, incremental block reports (IBR) and full block 
reports (FBR).  Well before node liveness is affected by inundation of IBRs and 
FBRs, the namenode performance will degrade to unacceptable level. It is really 
easy to test this. Create a wide job that creates a lot small files. 

However,making it lighter on the datanode side is a good idea. We have seen 
many cases where nodes are declared dead because the service actor thread is 
delayed/blocked. 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-10-16 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961540#comment-14961540
 ] 

Jitendra Nath Pandey commented on HDFS-9239:


bq. .. Well before node liveness is affected by inundation of IBRs and FBRs, 
the namenode performance will degrade to unacceptable level...

  Yes, indeed. But if datanodes are marked as dead in that situation, that 
completely destabilizes the system. At that point, even if we kill certain 
offending jobs, it takes a while before NN can come back to an acceptable 
service level. This proposal should help prevent the death after NN is past the 
overloading scenario.

  I think ZKFC healthcheck should also be separated into a different queue or 
port so that they are not blocked by other messages in NN's call queue. A 
failover because NN is busy is not very helpful. The other NN also gets busy 
and we end up seeing active-standby flip-flop between the namenodes.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-10-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959826#comment-14959826
 ] 

Daryn Sharp commented on HDFS-9239:
---

It seems like a good idea at first, but I don't think the proposal solves the 
stated issues:
* This  prevents the NameNode from spuriously marking healthy 
DataNodes as stale or dead.
* ... delayed DataNodes may be flagged as stale, and applications may 
erroneously choose to avoid accessing those nodes
* ... DataNodes may be flagged as dead. In extreme cases, this can cause a 
NameNode to schedule wasteful re­replication activity.

Let's say the NN can't service heartbeats to avoid false-staleness (stale 
defaults to 30s). That means it definitely can't process IBRs either. Would a 
lifeline to prevent the stale flag matter at this point?  At this level of 
congestion, nearly all of the nodes are going stale.  The staleness is probably 
the least of your worries.  If nodes are marked dead from inability to keep up 
with heartbeats (defaults to ~10min), the cluster itself is already.  Worrying 
about wasted replications is dubious because the NN can't issue replications if 
it can't process the heartbeats.

That is not a heavy load scenario.  From personal experience, it sounds like 
the fallout of a 120GB+ heap stop-the-world GC.  The NN wakes up, heartbeat 
monitor starts marking everything dead.  This sparks a replication storm, 
followed by invalidation storm, which the NN recovers from... unless it goes 
into another full GC.  The lifeline might help slow the rise of false-dead 
nodes.  However, I recently patched the heartbeat monitor to detect long GCs 
and be very gracious before marking nodes dead.

If I've misinterpreted anything, please describe the incident that prompted 
this approach so we can see if it would have helped.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-10-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957150#comment-14957150
 ] 

Chris Nauroth commented on HDFS-9239:
-

I briefly considered UDP for that very reason.  However, there isn't a strong 
precedent for UDP in Hadoop right now, aside from a few optional components 
like the HDFS NFS gateway and the Ganglia and StatsD metrics sinks.  I'm 
reluctant to add UDP delivery troubleshooting to the mix of operational 
concerns that administrators need to know to support core functionality.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2015-10-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956523#comment-14956523
 ] 

Steve Loughran commented on HDFS-9239:
--

Does this actually need RPC? Or could some UDP packet be submitted, with the 
recipient doing an initial auth check of the sender, then queuing it to update 
the internal state. After all, this is intended to be one-way announcements

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)