[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2013-04-15 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13632527#comment-13632527
 ] 

Chris Nauroth commented on HDFS-3990:
-

I filed HDFS-4702 to investigate removing the namesystem lock from this code 
path.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2013-04-14 Thread Jagane Sundar (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631357#comment-13631357
 ] 

Jagane Sundar commented on HDFS-3990:
-

Here is another instance of this problem.

Here is what I am trying to do:

Create a single VM developer environment that runs all daemons in a VM. The VM 
gets a DHCP IP address, but there is no hostname associated with the IP 
address. I configure hadoop using the DHCP IP address (e.g. 192.168.1.94) 
instead of the hostname, or 'localhost' or '127.0.0.1]'. Datanode registration 
fails because of this check.

HDFS-4269 creates an escape hatch just for 127.0.0.1. That does not solve my 
problem because I want to use the DHCP address 192.168.1.94. I want to use 
192.168.1.94 because I want to be able to access this VM from my host OS, or 
from other machines in the network (if I use bridged networking in the virtual 
NIC configuration).

I don't quite follow the original reasoning behind this check. Is there some 
fundamental reason why HDFS cannot operate in an environment where the IP 
address of the host cannot be resolved to a hostname?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2013-04-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631477#comment-13631477
 ] 

Chris Nauroth commented on HDFS-3990:
-

Hello, [~jagane].  I noticed that you commented on both this and HDFS-4269.  
I'm going to focus the response on HDFS-4269, so please see my comment there.  
Thanks!

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2013-01-17 Thread liang xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556060#comment-13556060
 ] 

liang xie commented on HDFS-3990:
-

we hit the same issue like [~cnauroth] on linux + CDH4.1.1 modified version,  
only different is 0.0.0.0 , not 127.0.0.1.
so i changed the registerDatanode code snippet based the final patch:
{code}
  if (hostname.equals(ip)) {
try {
  hostname = InetAddress.getByName(Server.getRemoteAddress()).
  getHostName();
} catch (UnknownHostException e) {
  LOG.warn(Unable to lookup hostname for DataNode  +
  ip +  which registered with hostname  +
  nodeReg.getHostName());
  throw new DisallowedDatanodeException(nodeReg);
}
  }
{code}

maybe it helpful for somebody else who hit the same issue.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-12-03 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508869#comment-13508869
 ] 

Chris Nauroth commented on HDFS-3990:
-

Daryn and Eli, we merged this change to branch-trunk-win on Friday, 11/30.  
Unfortunately, this had an unintended side effect of breaking on Windows, at 
least for single-node developer setups, because of the code change to reject 
registration of an unresolved data node:

{code}
  public void registerDatanode(DatanodeRegistration nodeReg)
  throws DisallowedDatanodeException {
InetAddress dnAddress = Server.getRemoteIp();
if (dnAddress != null) {
  // Mostly called inside an RPC, update ip and peer hostname
  String hostname = dnAddress.getHostName();
  String ip = dnAddress.getHostAddress();
  if (hostname.equals(ip)) {
LOG.warn(Unresolved datanode registration from  + ip);
throw new DisallowedDatanodeException(nodeReg);
  }
{code}

On Windows, 127.0.0.1 does not resolve to localhost.  It reports host name as 
127.0.0.1.  Therefore, on Windows, running pseudo-distributed mode or 
MiniDFSCluster-based tests always rejects the datanode registrations.  (See 
HADOOP-8414 for more discussion of the particulars of resolving 127.0.0.1 on 
Windows.)

Potential fixes I can think of:

# Add special case logic to allow registration if ip.equals(127.0.0.1).  This 
is the quick fix I applied to my dev environment to unblock myself last Friday.
# Add a check against NetUtils.getStaticResolution and register it with 
NetUtils.addStaticResolution(127.0.0.1, localhost) somewhere at 
initialization time.

Do you have an opinion on the best way to fix it?  I have a Windows VM ready to 
go, so I can code the patch and test.


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-12-03 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508879#comment-13508879
 ] 

Daryn Sharp commented on HDFS-3990:
---

The check was floated up out of {{DatanodeManager.checkInList}} which rejected 
unresolvable nodes.  Is it that {{InetAddress.getByName}} on windows doesn't 
resolve 127.0.0.1 and doesn't throw {{UnknownHostException}}, which makes it 
appear it didn't resolve?  I seem to have vague recollection of a similar issue 
before.



 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-12-03 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508889#comment-13508889
 ] 

Chris Nauroth commented on HDFS-3990:
-

The problem I observe is that a server accepts a client socket connection, gets 
the connection's InetAddress, and then getHostName returns 127.0.0.1.  Below 
is a short code sample that demonstrates the problem.  This is a very rough 
approximation of the IPC Server/Connection and DatanodeManager logic.  When I 
run this server on Mac, it prints connection from hostName = localhost, 
hostAddress = 127.0.0.1, canonicalHostName = localhost for any client 
connection.  On Windows, it prints connection from hostName = 127.0.0.1, 
hostAddress = 127.0.0.1, canonicalHostName = 127.0.0.1.

{code}
package cnauroth;

import java.io.PrintWriter;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.channels.ServerSocketChannel;

class Main {
  public static void main(String[] args) throws Exception {
ServerSocket ss = ServerSocketChannel.open().socket();
ss.bind(new InetSocketAddress(localhost, 1234), 0);
System.out.println(ss =  + ss);
for (;;) {
  Socket s = ss.accept();
  InetAddress addr = s.getInetAddress();
  System.out.println(connection from hostName =  + addr.getHostName() + 
, hostAddress =  + addr.getHostAddress() + , canonicalHostName =  + 
addr.getCanonicalHostName());
  PrintWriter pw = new PrintWriter(s.getOutputStream());
  pw.println(hello);
  pw.close();
  s.close();
}
  }
}
{code}


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493904#comment-13493904
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Yarn-trunk #31 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/31/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493973#comment-13493973
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Hdfs-0.23-Build #430 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/430/])
HDFS-3990. NN's health report has severe performance problems (daryn) 
(Revision 1407336)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407336
Files : 
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493985#comment-13493985
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Hdfs-trunk #1221 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1221/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494011#comment-13494011
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-Mapreduce-trunk #1251 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1251/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = FAILURE
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-11-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493651#comment-13493651
 ] 

Hudson commented on HDFS-3990:
--

Integrated in Hadoop-trunk-Commit #2985 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/2985/])
HDFS-3990.  NN's health report has severe performance problems (daryn) 
(Revision 1407333)

 Result = SUCCESS
daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487135#comment-13487135
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12551382/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3428//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3428//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-30 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487247#comment-13487247
 ] 

Eli Collins commented on HDFS-3990:
---

I missed that you switched to a List because we're conditionally adding items 
so hard to use an ImmutableList, I think using a List is better than the latest 
patch where you convert the List to an array, so +1 to the Oct 22nd patch

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, 
 HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-29 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486279#comment-13486279
 ] 

Eli Collins commented on HDFS-3990:
---

Now that you're switching getNodeNamesForHostFiltering from using an array to a 
List, I'd use an ImmutableList.  +1 otherwise

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481804#comment-13481804
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12550340/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3378//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3378//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-18 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479000#comment-13479000
 ] 

Daryn Sharp commented on HDFS-3990:
---

bq. Let's remove this line and leave peerHostName as null since we're claiming 
the peerHostname is the hostname from the actual connection. It's also useful 
to have something to check to indicate the peerHostName has not been determined.

The known case where the {{peerHostName}} will not be set is when the 
minicluster tests directly register a dn.  If the assignment is removed, then 
I'm not sure where the null check should be and what it should do?  It could 
either be in {{DatanodeID#getPeerHostName}} and return the {{hostName}} field?  
Or it could return null and  {{DatanodeManager#getNodeNamesForHostFiltering}} 
will not return the {{peerHostName}} if null?  I'm a bit concerned that tests, 
such as include/exclude list checks, might again break...  Or I could update 
the comment to indicate it's either the remote RPC host or the dn reg's 
hostname?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-18 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479207#comment-13479207
 ] 

Eli Collins commented on HDFS-3990:
---

A null peerHostname just means you don't match, since we also check hostName 
which reported by the DataNode which the mini cluster explicitly sets we should 
be good, that's the current behavior after all right? Ie the only thing we're 
adding here is an additional hostname field to check, which is null and we 
won't check in the tests. Related, would be good to make the minicluster match 
real cluster behavior here.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477909#comment-13477909
 ] 

Daryn Sharp commented on HDFS-3990:
---

In your patch, it's not necessary for the NN to do another lookup of the DN's 
hostname.  It's already available in the {{InetAddress}} returned by 
{{Server.getRemoteIp()}}.  Passing this {{InetAddress}} to {{updateNodeAddr}}, 
rather than individually update the hostname and ip ensures the host and ip are 
always updated in tandem to avoid your concern about the fields going out of 
sync.

If we do change the datanode manager to ignore the hostname in the node 
registration, do you think it's possible to update all the tests that check 
rack placement?  I'm not sure how we can do that in a timely manner, so would 
you be willing to have a separate jira for that functional change to expedite 
this compatible one?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477975#comment-13477975
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549515/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.balancer.TestBalancer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3355//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3355//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478000#comment-13478000
 ] 

Daryn Sharp commented on HDFS-3990:
---

The balancer test seems to randomly fail.  It passes for me.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478271#comment-13478271
 ] 

Eli Collins commented on HDFS-3990:
---

Think the approach in the latest patch should work. Once HDFS-4068 you can 
rebase on it and remove all the cleanup.

Comments:
- We can remove the dnAddress check for null now that it looks like 
NNThroughputBenchmark always uses RPC 
- Rename getNodeNames something more explicit like getNodeNamesForHostFiltering?
- Rather than have updateNodeAddr let's use the two setters explicitly, easier 
to follow the registration behavior (ie we explicitly clobber the ip and peer 
hostname). Hopefully we'll eventually be able to make DatanodeID immutable so 
we don't update it in place.
- Let's update getNodeNames to include the DN hostname since that is the 
current behavior, and file a separate jira for removing the use of the DN 
reported hostname here (or perhaps removing the reported DN field entirely)
- Let's update hashCode in a separate change. I think this will need some 
additional changes like modifying Host2NodesMap to use DataNodeID hashCode, it 
currently explicitly uses the IP addr for the hash and ignores 
DatanodeID#hashCode.
- Add a javadoc to testDNSLookups indicating that we're testing that the NN 
does *not* do DN lookups after registration 
- Nit, I'd create the SM inline via System.setSecurityManager(new 
SecurityManager() { so it's clear it's only associated with this DNS tests 
(like TestDFSShell for eg)
- Nit, rename lookups in the test to initialLookups

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478350#comment-13478350
 ] 

Daryn Sharp commented on HDFS-3990:
---

I'm making the changes, but I found that I cannot remove the null check for 
dnAddress in the nodemanager.  Tests using the minicluster directly get the rpc 
server (the remote/internal one of the NN) so no rpc socket connection is 
formed.

I also don't think I can inline the SecurityManager (I initially tried) 
otherwise I cannot get access to the count of lookups.  Java won't recognize 
that the field is available, or let me call a getter method because it's not 
defined by the SM.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478492#comment-13478492
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549589/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3358//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3358//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478557#comment-13478557
 ] 

Eli Collins commented on HDFS-3990:
---

Two small comments, +1 otherwise!

- Let's remove this line and leave peerHostName as null since we're claiming 
the peerHostname is the hostname from the actual connection. It's also useful 
to have something to check to indicate the peerHostName has not been 
determined.  
{code}this.peerHostName = hostName; // will assume it's the given host for 
now{code}
- move the // Update the IP to the address of the RPC request... comment up 
with the setIpAddr call

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478650#comment-13478650
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12549614/HDFS-3990.branch-0.23.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3360//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, 
 HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477053#comment-13477053
 ] 

Daryn Sharp commented on HDFS-3990:
---

bq. Maintaining both an ipAddr/hostName plus nodeAddr with the same 
information, which can become inconsistent is error prone. For example what do 
you do when the ipAddr and the nodeAddr disagree?

They should never disagree because the nodeAddr is based on the ipAddr, and 
when the nodeAddr is changed, so is the ipAddr.

bq. The ipAddr field for a DataNode ID should never change because it (and the 
xferPort) are the unique key for a DataNode.

They will change when a pre-existing node, say one with the same storage id, is 
updated with the new info.

bq.  We also now have to worry about the state where we're both resolved and 
unresolved.

We need to worry about that case just like the code did before.  Let's say the 
exclude list has hostnames.  A node registration occurs but there's a dns 
hiccup so all we have is its ip.  Your proposed patch may let the node in 
whereas the existing code (and my patch) will block the node.

bq.  What do you think of the attached patch? It sets the DatanodeID hostname 
field at registration time (like the IP addr) ...

The patch appears to change the way the include and exclude work by trusting 
who the datanode claims to be.  What if a datanode lies about who it is?  Or 
if a dns hiccup occurs when the datanode is going to register?  It sends its 
name as an ip, but the exclude list only has hosts.  There are a number of 
scenarios where a datanode could bypass the include/exclude list, which is why 
we should never trust the client.

bq. ... using the same lookup we do today and replaces the two problematic 
lookups with uses of this field.

Unless I've overlooked something, there's only one lookup that occurs?

I'll post a minor rev for consideration that should further ensure the fields 
never go out of sync.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477131#comment-13477131
 ] 

Eli Collins commented on HDFS-3990:
---

bq. They will change when a pre-existing node, say one with the same storage 
id, is updated with the new info.

I'm not sure re-registering with a new IP and the same storage ID actually 
works today.

bq. The patch appears to change the way the include and exclude work by 
trusting who the datanode claims to be. What if a datanode lies about who it 
is? Or if a dns hiccup occurs when the datanode is going to register? It sends 
its name as an ip, but the exclude list only has hosts. There are a number of 
scenarios where a datanode could bypass the include/exclude list, which is why 
we should never trust the client.

Take another look at the patch, the NN is doing the lookup not the DN, just at 
registration time. How about we reject the DN registration in case of a DNS 
hiccup (rather than use the DN value which the patch currently does in this 
case)? The DN will retry until it succeeds.  When working on HDFS-3171 I 
considered removing the ability for the DN to override the hostname, and have 
just one lookup per DN (ie currently both the NN and DN resolve the DN 
hostname). We could open a separate jira for that, might be easier to layer 
this one atop it.

I'm against having DatanodeID fields that duplicates the other fields since I 
think we can solve the problem here and avoid doing so. My experience from 
HDFS-3144 indicates we will introduce bugs and it's hard to correctly untangle 
later.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477181#comment-13477181
 ] 

Daryn Sharp commented on HDFS-3990:
---

bq. I'm not sure re-registering with a new IP and the same storage ID actually 
works today.

Jason Lowe recently finished a jira to make that work.

bq.  How about we reject the DN registration in case of a DNS hiccup (rather 
than use the DN value which the patch currently does in this case)?

I think I'm fine with that, so long as we are more strictly ruling out the 
ability to run a cluster in a dns-less or dns error-tolerant environment.  I 
was considering a second jira that would first scan the include/exclude for the 
ip, and if not found, would return include=false or exclude=true if the ip is 
unresolved instead of flat out rejecting the node.

Ignoring the name the dn declares is a trivial enough change that do you think 
we can just do it in this patch?  I was trying to avoid any functional change 
with this patch (because who knows what will break!) but I'll post a revised 
patch that rejects unresolved and ignores the dn's declared name if that's ok 
with you?



 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477296#comment-13477296
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549353/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy
  org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement
  org.apache.hadoop.hdfs.TestMiniDFSCluster
  org.apache.hadoop.cli.TestHDFSCLI
  
org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
  org.apache.hadoop.hdfs.TestReplication

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3347//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3347//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3347//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477373#comment-13477373
 ] 

Daryn Sharp commented on HDFS-3990:
---

Ignoring the hostname the datanode claims to be is blowing up tests that are 
checking rack placement.  Those tests need to use spoofed hostnames for the 
rack mapping.

Prior to the patch, only the include/exclude lists checked the real hostname.  
Using the datanode's claimed hostname for the include/exclude checks creates a 
security issue, and ignoring the claimed hostname causes tests to fail.  I was 
fearful that any functional change would break the code, so I'll toss up 
another variant of the original patch that keeps the two names separate.

We really need this dns fix, so I think we'll need to break the unified and 
proper handling of the dn hostnames to another jira.  Agree?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477409#comment-13477409
 ] 

Eli Collins commented on HDFS-3990:
---

Yea, that's what I meant above by This breaks dfs.datanode.hostname but this 
config is only used by the tests and we can fix those up. How about I fix up 
my previous patch to unconditionally set the hostname and fix the tests? 

Btw the latest patch has some changes like changing DatanodeID#hashCode to 
ignore the IP addr, I don't think that's correct.




 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477415#comment-13477415
 ] 

Daryn Sharp commented on HDFS-3990:
---

bq. Btw the latest patch has some changes like changing DatanodeID#hashCode to 
ignore the IP addr, I don't think that's correct.

The ip is mutable, so it can't be part of the {{hashCode}}.  When a datanode 
registers with an existing storage id, the ip  port will be updated which will 
affect a node in a collection.  The storage id is immutable and unique so 
basing the {{hashCode}} off of it should be sufficient?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477420#comment-13477420
 ] 

Daryn Sharp commented on HDFS-3990:
---

I forgot to mention that I think you'll find fixing/updating the rack placement 
tests will be exceedingly difficult w/o doing something very hacky.  Everything 
looks like localhost to a minicluster.  At least Kihwal and I couldn't find a 
clean way to update the tests...

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-16 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477620#comment-13477620
 ] 

Eli Collins commented on HDFS-3990:
---

Pulled cleanup out to HDFS-4068.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, 
 HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476139#comment-13476139
 ] 

Daryn Sharp commented on HDFS-3990:
---

The caching is to prevent the unnecessary dns lookups that are a multiple of 
the number of datanodes - typically just to view a jsp or query json, or for 
other internal operations as well.  Every time a node is checked against the 
include/exclude lists, it generates dns queries of 2X the datanodes.  Counting 
the number of nodes causes a dns query for every datanode.

Reassigning an ip should require no restart of the NN.  The DN's are tracked by 
their ip and storage id.  If a DN registers with a previously known ip or 
storage id, the existing node is updated with the fields in the new node id 
which contain a refreshed lookup.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476320#comment-13476320
 ] 

Daryn Sharp commented on HDFS-3990:
---

Pre-commit build is clean, but it failed to connect to jira:
https://builds.apache.org/job/PreCommit-HDFS-Build/3331/consoleText

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476429#comment-13476429
 ] 

Eli Collins commented on HDFS-3990:
---

Why not use the DatanodeID hostName field instead of calling and caching 
InetAddress#getByName in the NN? The DN has already done the lookup (modulo the 
tests which use dfs.datanode.hostname) and this way we don't have to worry 
about inconsistency between the nodeAddr and the ipAddr/hostName fields. For 
sanity the NN could do a lookup when the DN registers and compare it's value to 
the DN reported one.

Comments on this patch:
- In registerDatanode why is OK to no longer update the registration info with 
the reported IP?
- The comments in DatanodeManager (Mostly called inside an RPC... and Update 
the IP to the address of the RPC request..) are no longer accurate after your 
change.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476446#comment-13476446
 ] 

Daryn Sharp commented on HDFS-3990:
---

As best I can tell, the {{DatanodeID}}'s hostname is what the DN claims to be 
in the registration.  The existing include/exclude list checks use the DN's ip 
and real hostname, not the one the node claimed to be in the registration.  
I'm trying to preserve existing behavior by just caching the socket's peer name 
at registration, so that resolved socket addr can be reused when checking the 
include/exclude lists.

bq. In registerDatanode why is OK to no longer update the registration info 
with the reported IP?

The ip actually is updated when {{setNodeAddr}} is called with the socket's 
peer.

My bad on the comments.  I'm not sure how I lost that change.

I know the approach isn't perfect, and many of the fields could likely be 
folded together into the socket addr, but I'm trying to make the minimalist 
change to avoid a slew of dns queries that are having an adverse performance 
impact on multi-thousand node clusters.


 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476461#comment-13476461
 ] 

Daryn Sharp commented on HDFS-3990:
---

I'm also handling the case where a transient dns error may have occurred at the 
time a socket connected.  The patch will attempt another lookup when the 
nodeAddr is requested.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476502#comment-13476502
 ] 

Ravi Prakash commented on HDFS-3990:


Thanks for your explanations Daryn! The src/main code looks reasonable to me.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476641#comment-13476641
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12549228/hdfs-3990.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement
  org.apache.hadoop.cli.TestHDFSCLI
  
org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
  org.apache.hadoop.hdfs.TestMiniDFSCluster
  org.apache.hadoop.hdfs.TestReplication
  
org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3338//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3338//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475370#comment-13475370
 ] 

Hadoop QA commented on HDFS-3990:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12548945/HDFS-3990.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3323//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3323//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3323//console

This message is automatically generated.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-10-12 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475497#comment-13475497
 ] 

Ravi Prakash commented on HDFS-3990:


I'm sorry I've been out of the loop, but why would caching be the solution? 
If we want to reassign the IP addresse to hostname for a single node, would it 
require a restart of the NN? Is there a timeout with the caching? Even with a 
timeout I would have my reservations.
Do nodes have hadoop generated unique IDs that we can leverage and match with 
IP addresses that we have cached?

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HDFS-3990.patch


 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-09-28 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465597#comment-13465597
 ] 

Daryn Sharp commented on HDFS-3990:
---

Enabling a nscd host cache helped mitigate the issue by reducing load times to 
a few seconds.  However the namespace read lock is highly undesirable, and the 
repeated dns lookups are questionable.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical

 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

2012-09-28 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465598#comment-13465598
 ] 

Daryn Sharp commented on HDFS-3990:
---

Arun, please update the target version if you want to defer the fix to a later 
2.x release.

 NN's health report has severe performance problems
 --

 Key: HDFS-3990
 URL: https://issues.apache.org/jira/browse/HDFS-3990
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical

 The dfshealth page will place a read lock on the namespace while it does a 
 dns lookup for every DN.  On a multi-thousand node cluster, this often 
 results in 10s+ load time for the health page.  10 concurrent requests were 
 found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira