subject:"\[jira\] \[Commented\] \(HDFS\-3990\) NN's health report has severe performance problems"


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493904#comment-13493904
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Yarn-trunk #31 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/31/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493973#comment-13493973
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Hdfs-0.23-Build #430 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/430/])
HDFS-3990. NN's health report has severe performance problems (daryn) 
(Revision 1407336)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407336
Files : 
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493985#comment-13493985
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Hdfs-trunk #1221 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1221/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494011#comment-13494011
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Mapreduce-trunk #1251 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1251/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = FAILURE
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-11-08 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493651#comment-13493651
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-trunk-Commit #2985 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/2985/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-30 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487135#comment-13487135
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12551382/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3428//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3428//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-30 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487247#comment-13487247
 ] 

Eli Collins commented on HDFS-3990:
---

I missed that you switched to a List because we're conditionally adding items 
so hard to use an ImmutableList, I think using a List is better than the latest 
patch where you convert the List to an array, so +1 to the Oct 22nd patch

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-29 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486279#comment-13486279
 ] 

Eli Collins commented on HDFS-3990:
---

Now that you're switching getNodeNamesForHostFiltering from using an array to a 
List, I'd use an ImmutableList.  +1 otherwise

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481804#comment-13481804
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12550340/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3378//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3378//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-18 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479000#comment-13479000
 ] 

Daryn Sharp commented on HDFS-3990:
---

bq. Let's remove this line and leave peerHostName as null since we're claiming 
the peerHostname is the hostname from the actual connection. It's also useful 
to have something to check to indicate the peerHostName has not been determined.

The known case where the {{peerHostName}} will not be set is when the 
minicluster tests directly register a dn.  If the assignment is removed, then 
I'm not sure where the null check should be and what it should do?  It could 
either be in {{DatanodeID#getPeerHostName}} and return the {{hostName}} field?  
Or it could return null and  {{DatanodeManager#getNodeNamesForHostFiltering}} 
will not return the {{peerHostName}} if null?  I'm a bit concerned that tests, 
such as include/exclude list checks, might again break...  Or I could update 
the comment to indicate it's either the remote RPC host or the dn reg's 
hostname?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-18 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479207#comment-13479207
 ] 

Eli Collins commented on HDFS-3990:
---

A null peerHostname just means you don't match, since we also check hostName 
which reported by the DataNode which the mini cluster explicitly sets we should 
be good, that's the current behavior after all right? Ie the only thing we're 
adding here is an additional hostname field to check, which is null and we 
won't check in the tests. Related, would be good to make the minicluster match 
real cluster behavior here.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477909#comment-13477909
 ] 

Daryn Sharp commented on HDFS-3990:
---

In your patch, it's not necessary for the NN to do another lookup of the DN's 
hostname.  It's already available in the {{InetAddress}} returned by 
{{Server.getRemoteIp()}}.  Passing this {{InetAddress}} to {{updateNodeAddr}}, 
rather than individually update the hostname and ip ensures the host and ip are 
always updated in tandem to avoid your concern about the fields going out of 
sync.

If we do change the datanode manager to ignore the hostname in the node 
registration, do you think it's possible to update all the tests that check 
rack placement?  I'm not sure how we can do that in a timely manner, so would 
you be willing to have a separate jira for that functional change to expedite 
this compatible one?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477975#comment-13477975
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549515/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.balancer.TestBalancer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3355//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3355//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478000#comment-13478000
 ] 

Daryn Sharp commented on HDFS-3990:
---

The balancer test seems to randomly fail.  It passes for me.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478271#comment-13478271
 ] 

Eli Collins commented on HDFS-3990:
---

Think the approach in the latest patch should work. Once HDFS-4068 you can 
rebase on it and remove all the cleanup.

Comments:
- We can remove the dnAddress check for null now that it looks like 
NNThroughputBenchmark always uses RPC 
- Rename getNodeNames something more explicit like getNodeNamesForHostFiltering?
- Rather than have updateNodeAddr let's use the two setters explicitly, easier 
to follow the registration behavior (ie we explicitly clobber the ip and peer 
hostname). Hopefully we'll eventually be able to make DatanodeID immutable so 
we don't update it in place.
- Let's update getNodeNames to include the DN hostname since that is the 
current behavior, and file a separate jira for removing the use of the DN 
reported hostname here (or perhaps removing the reported DN field entirely)
- Let's update hashCode in a separate change. I think this will need some 
additional changes like modifying Host2NodesMap to use DataNodeID hashCode, it 
currently explicitly uses the IP addr for the hash and ignores 
DatanodeID#hashCode.
- Add a javadoc to testDNSLookups indicating that we're testing that the NN 
does *not* do DN lookups after registration 
- Nit, I'd create the SM inline via System.setSecurityManager(new 
SecurityManager() { so it's clear it's only associated with this DNS tests 
(like TestDFSShell for eg)
- Nit, rename lookups in the test to initialLookups

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478350#comment-13478350
 ] 

Daryn Sharp commented on HDFS-3990:
---

I'm making the changes, but I found that I cannot remove the null check for 
dnAddress in the nodemanager.  Tests using the minicluster directly get the rpc 
server (the remote/internal one of the NN) so no rpc socket connection is 
formed.

I also don't think I can inline the SecurityManager (I initially tried) 
otherwise I cannot get access to the count of lookups.  Java won't recognize 
that the field is available, or let me call a getter method because it's not 
defined by the SM.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478492#comment-13478492
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549589/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3358//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3358//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478557#comment-13478557
 ] 

Eli Collins commented on HDFS-3990:
---

Two small comments, +1 otherwise!

- Let's remove this line and leave peerHostName as null since we're claiming 
the peerHostname is the hostname from the actual connection. It's also useful 
to have something to check to indicate the peerHostName has not been 
determined.  
{code}this.peerHostName = hostName; // will assume it's the given host for 
now{code}
- move the // Update the IP to the address of the RPC request... comment up 
with the setIpAddr call

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478650#comment-13478650
]

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12549614/HDFS-3990.branch-0.23.patch
against trunk revision .

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3360//console

This message is automatically generated.

NN's health report has severe performance problems
--

Key: HDFS-3990
URL: https://issues.apache.org/jira/browse/HDFS-3990
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch,
HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch,
HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477053#comment-13477053
]

Daryn Sharp commented on HDFS-3990:
---

bq. Maintaining both an ipAddr/hostName plus nodeAddr with the same
information, which can become inconsistent is error prone. For example what do
you do when the ipAddr and the nodeAddr disagree?

They should never disagree because the nodeAddr is based on the ipAddr, and
when the nodeAddr is changed, so is the ipAddr.

bq. The ipAddr field for a DataNode ID should never change because it (and the
xferPort) are the unique key for a DataNode.

They will change when a pre-existing node, say one with the same storage id, is
updated with the new info.

bq. We also now have to worry about the state where we're both resolved and
unresolved.

We need to worry about that case just like the code did before. Let's say the
exclude list has hostnames. A node registration occurs but there's a dns
hiccup so all we have is its ip. Your proposed patch may let the node in
whereas the existing code (and my patch) will block the node.

bq. What do you think of the attached patch? It sets the DatanodeID hostname
field at registration time (like the IP addr) ...

The patch appears to change the way the include and exclude work by trusting
who the datanode claims to be. What if a datanode lies about who it is? Or
if a dns hiccup occurs when the datanode is going to register? It sends its
name as an ip, but the exclude list only has hosts. There are a number of
scenarios where a datanode could bypass the include/exclude list, which is why
we should never trust the client.

bq. ... using the same lookup we do today and replaces the two problematic
lookups with uses of this field.

Unless I've overlooked something, there's only one lookup that occurs?

I'll post a minor rev for consideration that should further ensure the fields
never go out of sync.

NN's health report has severe performance problems
--

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Eli Collins (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477131#comment-13477131
]

Eli Collins commented on HDFS-3990:
---

bq. They will change when a pre-existing node, say one with the same storage
id, is updated with the new info.

I'm not sure re-registering with a new IP and the same storage ID actually
works today.

bq. The patch appears to change the way the include and exclude work by
trusting who the datanode claims to be. What if a datanode lies about who it
is? Or if a dns hiccup occurs when the datanode is going to register? It sends
its name as an ip, but the exclude list only has hosts. There are a number of
scenarios where a datanode could bypass the include/exclude list, which is why
we should never trust the client.

Take another look at the patch, the NN is doing the lookup not the DN, just at
registration time. How about we reject the DN registration in case of a DNS
hiccup (rather than use the DN value which the patch currently does in this
case)? The DN will retry until it succeeds. When working on HDFS-3171 I
considered removing the ability for the DN to override the hostname, and have
just one lookup per DN (ie currently both the NN and DN resolve the DN
hostname). We could open a separate jira for that, might be easier to layer
this one atop it.

I'm against having DatanodeID fields that duplicates the other fields since I
think we can solve the problem here and avoid doing so. My experience from
HDFS-3144 indicates we will introduce bugs and it's hard to correctly untangle
later.

NN's health report has severe performance problems
--

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477181#comment-13477181
]

Daryn Sharp commented on HDFS-3990:
---

bq. I'm not sure re-registering with a new IP and the same storage ID actually
works today.

Jason Lowe recently finished a jira to make that work.

bq. How about we reject the DN registration in case of a DNS hiccup (rather
than use the DN value which the patch currently does in this case)?

I think I'm fine with that, so long as we are more strictly ruling out the
ability to run a cluster in a dns-less or dns error-tolerant environment. I
was considering a second jira that would first scan the include/exclude for the
ip, and if not found, would return include=false or exclude=true if the ip is
unresolved instead of flat out rejecting the node.

Ignoring the name the dn declares is a trivial enough change that do you think
we can just do it in this patch? I was trying to avoid any functional change
with this patch (because who knows what will break!) but I'll post a revised
patch that rejects unresolved and ignores the dn's declared name if that's ok
with you?

NN's health report has severe performance problems
--

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477296#comment-13477296
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549353/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy
  org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement
  org.apache.hadoop.hdfs.TestMiniDFSCluster
  org.apache.hadoop.cli.TestHDFSCLI
  
org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
  org.apache.hadoop.hdfs.TestReplication

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3347//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3347//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3347//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477373#comment-13477373
]

Daryn Sharp commented on HDFS-3990:
---

Ignoring the hostname the datanode claims to be is blowing up tests that are
checking rack placement. Those tests need to use spoofed hostnames for the
rack mapping.

Prior to the patch, only the include/exclude lists checked the real hostname.
Using the datanode's claimed hostname for the include/exclude checks creates a
security issue, and ignoring the claimed hostname causes tests to fail. I was
fearful that any functional change would break the code, so I'll toss up
another variant of the original patch that keeps the two names separate.

We really need this dns fix, so I think we'll need to break the unified and
proper handling of the dn hostnames to another jira. Agree?

NN's health report has severe performance problems
--

Key: HDFS-3990
URL: https://issues.apache.org/jira/browse/HDFS-3990
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch,
hdfs-3990.txt

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477409#comment-13477409
 ] 

Eli Collins commented on HDFS-3990:
---

Yea, that's what I meant above by This breaks dfs.datanode.hostname but this 
config is only used by the tests and we can fix those up. How about I fix up 
my previous patch to unconditionally set the hostname and fix the tests? 

Btw the latest patch has some changes like changing DatanodeID#hashCode to 
ignore the IP addr, I don't think that's correct.




 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477415#comment-13477415
]

Daryn Sharp commented on HDFS-3990:
---

bq. Btw the latest patch has some changes like changing DatanodeID#hashCode to
ignore the IP addr, I don't think that's correct.

The ip is mutable, so it can't be part of the {{hashCode}}. When a datanode
registers with an existing storage id, the ip port will be updated which will
affect a node in a collection. The storage id is immutable and unique so
basing the {{hashCode}} off of it should be sufficient?

NN's health report has severe performance problems
--

Key: HDFS-3990
URL: https://issues.apache.org/jira/browse/HDFS-3990
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch,
HDFS-3990.patch, hdfs-3990.txt

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477420#comment-13477420
 ] 

Daryn Sharp commented on HDFS-3990:
---

I forgot to mention that I think you'll find fixing/updating the rack placement 
tests will be exceedingly difficult w/o doing something very hacky.  Everything 
looks like localhost to a minicluster.  At least Kihwal and I couldn't find a 
clean way to update the tests...

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477620#comment-13477620
 ] 

Eli Collins commented on HDFS-3990:
---

Pulled cleanup out to HDFS-4068.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476139#comment-13476139
]

Daryn Sharp commented on HDFS-3990:
---

The caching is to prevent the unnecessary dns lookups that are a multiple of
the number of datanodes - typically just to view a jsp or query json, or for
other internal operations as well. Every time a node is checked against the
include/exclude lists, it generates dns queries of 2X the datanodes. Counting
the number of nodes causes a dns query for every datanode.

Reassigning an ip should require no restart of the NN. The DN's are tracked by
their ip and storage id. If a DN registers with a previously known ip or
storage id, the existing node is updated with the fields in the new node id
which contain a refreshed lookup.

NN's health report has severe performance problems
--

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems


[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476320#comment-13476320
 ] 

Daryn Sharp commented on HDFS-3990:
---

Pre-commit build is clean, but it failed to connect to jira:
https://builds.apache.org/job/PreCommit-HDFS-Build/3331/consoleText

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Eli Collins (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476429#comment-13476429
]

Eli Collins commented on HDFS-3990:
---

Why not use the DatanodeID hostName field instead of calling and caching
InetAddress#getByName in the NN? The DN has already done the lookup (modulo the
tests which use dfs.datanode.hostname) and this way we don't have to worry
about inconsistency between the nodeAddr and the ipAddr/hostName fields. For
sanity the NN could do a lookup when the DN registers and compare it's value to
the DN reported one.

Comments on this patch:
- In registerDatanode why is OK to no longer update the registration info with
the reported IP?
- The comments in DatanodeManager (Mostly called inside an RPC... and Update
the IP to the address of the RPC request..) are no longer accurate after your
change.

NN's health report has severe performance problems
--

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

[
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476446#comment-13476446
]

Daryn Sharp commented on HDFS-3990:
---

As best I can tell, the {{DatanodeID}}'s hostname is what the DN claims to be
in the registration. The existing include/exclude list checks use the DN's ip
and real hostname, not the one the node claimed to be in the registration.
I'm trying to preserve existing behavior by just caching the socket's peer name
at registration, so that resolved socket addr can be reused when checking the
include/exclude lists.

bq. In registerDatanode why is OK to no longer update the registration info
with the reported IP?

The ip actually is updated when {{setNodeAddr}} is called with the socket's
peer.

My bad on the comments. I'm not sure how I lost that change.

I know the approach isn't perfect, and many of the fields could likely be
folded together into the socket addr, but I'm trying to make the minimalist
change to avoid a slew of dns queries that are having an adverse performance
impact on multi-thousand node clusters.

NN's health report has severe performance problems
--

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems