[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13632527#comment-13632527 ] Chris Nauroth commented on HDFS-3990: - I filed HDFS-4702 to investigate removing the namesystem lock from this code path. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631357#comment-13631357 ] Jagane Sundar commented on HDFS-3990: - Here is another instance of this problem. Here is what I am trying to do: Create a single VM developer environment that runs all daemons in a VM. The VM gets a DHCP IP address, but there is no hostname associated with the IP address. I configure hadoop using the DHCP IP address (e.g. 192.168.1.94) instead of the hostname, or 'localhost' or '127.0.0.1]'. Datanode registration fails because of this check. HDFS-4269 creates an escape hatch just for 127.0.0.1. That does not solve my problem because I want to use the DHCP address 192.168.1.94. I want to use 192.168.1.94 because I want to be able to access this VM from my host OS, or from other machines in the network (if I use bridged networking in the virtual NIC configuration). I don't quite follow the original reasoning behind this check. Is there some fundamental reason why HDFS cannot operate in an environment where the IP address of the host cannot be resolved to a hostname? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631477#comment-13631477 ] Chris Nauroth commented on HDFS-3990: - Hello, [~jagane]. I noticed that you commented on both this and HDFS-4269. I'm going to focus the response on HDFS-4269, so please see my comment there. Thanks! NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556060#comment-13556060 ] liang xie commented on HDFS-3990: - we hit the same issue like [~cnauroth] on linux + CDH4.1.1 modified version, only different is 0.0.0.0 , not 127.0.0.1. so i changed the registerDatanode code snippet based the final patch: {code} if (hostname.equals(ip)) { try { hostname = InetAddress.getByName(Server.getRemoteAddress()). getHostName(); } catch (UnknownHostException e) { LOG.warn(Unable to lookup hostname for DataNode + ip + which registered with hostname + nodeReg.getHostName()); throw new DisallowedDatanodeException(nodeReg); } } {code} maybe it helpful for somebody else who hit the same issue. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508869#comment-13508869 ] Chris Nauroth commented on HDFS-3990: - Daryn and Eli, we merged this change to branch-trunk-win on Friday, 11/30. Unfortunately, this had an unintended side effect of breaking on Windows, at least for single-node developer setups, because of the code change to reject registration of an unresolved data node: {code} public void registerDatanode(DatanodeRegistration nodeReg) throws DisallowedDatanodeException { InetAddress dnAddress = Server.getRemoteIp(); if (dnAddress != null) { // Mostly called inside an RPC, update ip and peer hostname String hostname = dnAddress.getHostName(); String ip = dnAddress.getHostAddress(); if (hostname.equals(ip)) { LOG.warn(Unresolved datanode registration from + ip); throw new DisallowedDatanodeException(nodeReg); } {code} On Windows, 127.0.0.1 does not resolve to localhost. It reports host name as 127.0.0.1. Therefore, on Windows, running pseudo-distributed mode or MiniDFSCluster-based tests always rejects the datanode registrations. (See HADOOP-8414 for more discussion of the particulars of resolving 127.0.0.1 on Windows.) Potential fixes I can think of: # Add special case logic to allow registration if ip.equals(127.0.0.1). This is the quick fix I applied to my dev environment to unblock myself last Friday. # Add a check against NetUtils.getStaticResolution and register it with NetUtils.addStaticResolution(127.0.0.1, localhost) somewhere at initialization time. Do you have an opinion on the best way to fix it? I have a Windows VM ready to go, so I can code the patch and test. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508879#comment-13508879 ] Daryn Sharp commented on HDFS-3990: --- The check was floated up out of {{DatanodeManager.checkInList}} which rejected unresolvable nodes. Is it that {{InetAddress.getByName}} on windows doesn't resolve 127.0.0.1 and doesn't throw {{UnknownHostException}}, which makes it appear it didn't resolve? I seem to have vague recollection of a similar issue before. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508889#comment-13508889 ] Chris Nauroth commented on HDFS-3990: - The problem I observe is that a server accepts a client socket connection, gets the connection's InetAddress, and then getHostName returns 127.0.0.1. Below is a short code sample that demonstrates the problem. This is a very rough approximation of the IPC Server/Connection and DatanodeManager logic. When I run this server on Mac, it prints connection from hostName = localhost, hostAddress = 127.0.0.1, canonicalHostName = localhost for any client connection. On Windows, it prints connection from hostName = 127.0.0.1, hostAddress = 127.0.0.1, canonicalHostName = 127.0.0.1. {code} package cnauroth; import java.io.PrintWriter; import java.net.InetAddress; import java.net.InetSocketAddress; import java.net.ServerSocket; import java.net.Socket; import java.nio.channels.ServerSocketChannel; class Main { public static void main(String[] args) throws Exception { ServerSocket ss = ServerSocketChannel.open().socket(); ss.bind(new InetSocketAddress(localhost, 1234), 0); System.out.println(ss = + ss); for (;;) { Socket s = ss.accept(); InetAddress addr = s.getInetAddress(); System.out.println(connection from hostName = + addr.getHostName() + , hostAddress = + addr.getHostAddress() + , canonicalHostName = + addr.getCanonicalHostName()); PrintWriter pw = new PrintWriter(s.getOutputStream()); pw.println(hello); pw.close(); s.close(); } } } {code} NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493904#comment-13493904 ] Hudson commented on HDFS-3990: -- Integrated in Hadoop-Yarn-trunk #31 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/31/]) HDFS-3990. NN's health report has severe performance problems (daryn) (Revision 1407333) Result = SUCCESS daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493973#comment-13493973 ] Hudson commented on HDFS-3990: -- Integrated in Hadoop-Hdfs-0.23-Build #430 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/430/]) HDFS-3990. NN's health report has severe performance problems (daryn) (Revision 1407336) Result = SUCCESS daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407336 Files : * /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java * /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493985#comment-13493985 ] Hudson commented on HDFS-3990: -- Integrated in Hadoop-Hdfs-trunk #1221 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1221/]) HDFS-3990. NN's health report has severe performance problems (daryn) (Revision 1407333) Result = SUCCESS daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494011#comment-13494011 ] Hudson commented on HDFS-3990: -- Integrated in Hadoop-Mapreduce-trunk #1251 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1251/]) HDFS-3990. NN's health report has severe performance problems (daryn) (Revision 1407333) Result = FAILURE daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493651#comment-13493651 ] Hudson commented on HDFS-3990: -- Integrated in Hadoop-trunk-Commit #2985 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2985/]) HDFS-3990. NN's health report has severe performance problems (daryn) (Revision 1407333) Result = SUCCESS daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1407333 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeID.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487135#comment-13487135 ] Hadoop QA commented on HDFS-3990: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12551382/HDFS-3990.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3428//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3428//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487247#comment-13487247 ] Eli Collins commented on HDFS-3990: --- I missed that you switched to a List because we're conditionally adding items so hard to use an ImmutableList, I think using a List is better than the latest patch where you convert the List to an array, so +1 to the Oct 22nd patch NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486279#comment-13486279 ] Eli Collins commented on HDFS-3990: --- Now that you're switching getNodeNamesForHostFiltering from using an array to a List, I'd use an ImmutableList. +1 otherwise NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481804#comment-13481804 ] Hadoop QA commented on HDFS-3990: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12550340/HDFS-3990.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3378//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3378//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479000#comment-13479000 ] Daryn Sharp commented on HDFS-3990: --- bq. Let's remove this line and leave peerHostName as null since we're claiming the peerHostname is the hostname from the actual connection. It's also useful to have something to check to indicate the peerHostName has not been determined. The known case where the {{peerHostName}} will not be set is when the minicluster tests directly register a dn. If the assignment is removed, then I'm not sure where the null check should be and what it should do? It could either be in {{DatanodeID#getPeerHostName}} and return the {{hostName}} field? Or it could return null and {{DatanodeManager#getNodeNamesForHostFiltering}} will not return the {{peerHostName}} if null? I'm a bit concerned that tests, such as include/exclude list checks, might again break... Or I could update the comment to indicate it's either the remote RPC host or the dn reg's hostname? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479207#comment-13479207 ] Eli Collins commented on HDFS-3990: --- A null peerHostname just means you don't match, since we also check hostName which reported by the DataNode which the mini cluster explicitly sets we should be good, that's the current behavior after all right? Ie the only thing we're adding here is an additional hostname field to check, which is null and we won't check in the tests. Related, would be good to make the minicluster match real cluster behavior here. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477909#comment-13477909 ] Daryn Sharp commented on HDFS-3990: --- In your patch, it's not necessary for the NN to do another lookup of the DN's hostname. It's already available in the {{InetAddress}} returned by {{Server.getRemoteIp()}}. Passing this {{InetAddress}} to {{updateNodeAddr}}, rather than individually update the hostname and ip ensures the host and ip are always updated in tandem to avoid your concern about the fields going out of sync. If we do change the datanode manager to ignore the hostname in the node registration, do you think it's possible to update all the tests that check rack placement? I'm not sure how we can do that in a timely manner, so would you be willing to have a separate jira for that functional change to expedite this compatible one? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477975#comment-13477975 ] Hadoop QA commented on HDFS-3990: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549515/HDFS-3990.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3355//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3355//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478000#comment-13478000 ] Daryn Sharp commented on HDFS-3990: --- The balancer test seems to randomly fail. It passes for me. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478271#comment-13478271 ] Eli Collins commented on HDFS-3990: --- Think the approach in the latest patch should work. Once HDFS-4068 you can rebase on it and remove all the cleanup. Comments: - We can remove the dnAddress check for null now that it looks like NNThroughputBenchmark always uses RPC - Rename getNodeNames something more explicit like getNodeNamesForHostFiltering? - Rather than have updateNodeAddr let's use the two setters explicitly, easier to follow the registration behavior (ie we explicitly clobber the ip and peer hostname). Hopefully we'll eventually be able to make DatanodeID immutable so we don't update it in place. - Let's update getNodeNames to include the DN hostname since that is the current behavior, and file a separate jira for removing the use of the DN reported hostname here (or perhaps removing the reported DN field entirely) - Let's update hashCode in a separate change. I think this will need some additional changes like modifying Host2NodesMap to use DataNodeID hashCode, it currently explicitly uses the IP addr for the hash and ignores DatanodeID#hashCode. - Add a javadoc to testDNSLookups indicating that we're testing that the NN does *not* do DN lookups after registration - Nit, I'd create the SM inline via System.setSecurityManager(new SecurityManager() { so it's clear it's only associated with this DNS tests (like TestDFSShell for eg) - Nit, rename lookups in the test to initialLookups NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478350#comment-13478350 ] Daryn Sharp commented on HDFS-3990: --- I'm making the changes, but I found that I cannot remove the null check for dnAddress in the nodemanager. Tests using the minicluster directly get the rpc server (the remote/internal one of the NN) so no rpc socket connection is formed. I also don't think I can inline the SecurityManager (I initially tried) otherwise I cannot get access to the count of lookups. Java won't recognize that the field is available, or let me call a getter method because it's not defined by the SM. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478492#comment-13478492 ] Hadoop QA commented on HDFS-3990: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549589/HDFS-3990.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3358//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3358//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478557#comment-13478557 ] Eli Collins commented on HDFS-3990: --- Two small comments, +1 otherwise! - Let's remove this line and leave peerHostName as null since we're claiming the peerHostname is the hostname from the actual connection. It's also useful to have something to check to indicate the peerHostName has not been determined. {code}this.peerHostName = hostName; // will assume it's the given host for now{code} - move the // Update the IP to the address of the RPC request... comment up with the setIpAddr call NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478650#comment-13478650 ] Hadoop QA commented on HDFS-3990: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549614/HDFS-3990.branch-0.23.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3360//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.branch-0.23.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477053#comment-13477053 ] Daryn Sharp commented on HDFS-3990: --- bq. Maintaining both an ipAddr/hostName plus nodeAddr with the same information, which can become inconsistent is error prone. For example what do you do when the ipAddr and the nodeAddr disagree? They should never disagree because the nodeAddr is based on the ipAddr, and when the nodeAddr is changed, so is the ipAddr. bq. The ipAddr field for a DataNode ID should never change because it (and the xferPort) are the unique key for a DataNode. They will change when a pre-existing node, say one with the same storage id, is updated with the new info. bq. We also now have to worry about the state where we're both resolved and unresolved. We need to worry about that case just like the code did before. Let's say the exclude list has hostnames. A node registration occurs but there's a dns hiccup so all we have is its ip. Your proposed patch may let the node in whereas the existing code (and my patch) will block the node. bq. What do you think of the attached patch? It sets the DatanodeID hostname field at registration time (like the IP addr) ... The patch appears to change the way the include and exclude work by trusting who the datanode claims to be. What if a datanode lies about who it is? Or if a dns hiccup occurs when the datanode is going to register? It sends its name as an ip, but the exclude list only has hosts. There are a number of scenarios where a datanode could bypass the include/exclude list, which is why we should never trust the client. bq. ... using the same lookup we do today and replaces the two problematic lookups with uses of this field. Unless I've overlooked something, there's only one lookup that occurs? I'll post a minor rev for consideration that should further ensure the fields never go out of sync. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477131#comment-13477131 ] Eli Collins commented on HDFS-3990: --- bq. They will change when a pre-existing node, say one with the same storage id, is updated with the new info. I'm not sure re-registering with a new IP and the same storage ID actually works today. bq. The patch appears to change the way the include and exclude work by trusting who the datanode claims to be. What if a datanode lies about who it is? Or if a dns hiccup occurs when the datanode is going to register? It sends its name as an ip, but the exclude list only has hosts. There are a number of scenarios where a datanode could bypass the include/exclude list, which is why we should never trust the client. Take another look at the patch, the NN is doing the lookup not the DN, just at registration time. How about we reject the DN registration in case of a DNS hiccup (rather than use the DN value which the patch currently does in this case)? The DN will retry until it succeeds. When working on HDFS-3171 I considered removing the ability for the DN to override the hostname, and have just one lookup per DN (ie currently both the NN and DN resolve the DN hostname). We could open a separate jira for that, might be easier to layer this one atop it. I'm against having DatanodeID fields that duplicates the other fields since I think we can solve the problem here and avoid doing so. My experience from HDFS-3144 indicates we will introduce bugs and it's hard to correctly untangle later. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477181#comment-13477181 ] Daryn Sharp commented on HDFS-3990: --- bq. I'm not sure re-registering with a new IP and the same storage ID actually works today. Jason Lowe recently finished a jira to make that work. bq. How about we reject the DN registration in case of a DNS hiccup (rather than use the DN value which the patch currently does in this case)? I think I'm fine with that, so long as we are more strictly ruling out the ability to run a cluster in a dns-less or dns error-tolerant environment. I was considering a second jira that would first scan the include/exclude for the ip, and if not found, would return include=false or exclude=true if the ip is unresolved instead of flat out rejecting the node. Ignoring the name the dn declares is a trivial enough change that do you think we can just do it in this patch? I was trying to avoid any functional change with this patch (because who knows what will break!) but I'll post a revised patch that rejects unresolved and ignores the dn's declared name if that's ok with you? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477296#comment-13477296 ] Hadoop QA commented on HDFS-3990: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549353/HDFS-3990.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement org.apache.hadoop.hdfs.TestMiniDFSCluster org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks org.apache.hadoop.hdfs.TestReplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3347//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3347//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3347//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477373#comment-13477373 ] Daryn Sharp commented on HDFS-3990: --- Ignoring the hostname the datanode claims to be is blowing up tests that are checking rack placement. Those tests need to use spoofed hostnames for the rack mapping. Prior to the patch, only the include/exclude lists checked the real hostname. Using the datanode's claimed hostname for the include/exclude checks creates a security issue, and ignoring the claimed hostname causes tests to fail. I was fearful that any functional change would break the code, so I'll toss up another variant of the original patch that keeps the two names separate. We really need this dns fix, so I think we'll need to break the unified and proper handling of the dn hostnames to another jira. Agree? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477409#comment-13477409 ] Eli Collins commented on HDFS-3990: --- Yea, that's what I meant above by This breaks dfs.datanode.hostname but this config is only used by the tests and we can fix those up. How about I fix up my previous patch to unconditionally set the hostname and fix the tests? Btw the latest patch has some changes like changing DatanodeID#hashCode to ignore the IP addr, I don't think that's correct. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477415#comment-13477415 ] Daryn Sharp commented on HDFS-3990: --- bq. Btw the latest patch has some changes like changing DatanodeID#hashCode to ignore the IP addr, I don't think that's correct. The ip is mutable, so it can't be part of the {{hashCode}}. When a datanode registers with an existing storage id, the ip port will be updated which will affect a node in a collection. The storage id is immutable and unique so basing the {{hashCode}} off of it should be sufficient? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477420#comment-13477420 ] Daryn Sharp commented on HDFS-3990: --- I forgot to mention that I think you'll find fixing/updating the rack placement tests will be exceedingly difficult w/o doing something very hacky. Everything looks like localhost to a minicluster. At least Kihwal and I couldn't find a clean way to update the tests... NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477620#comment-13477620 ] Eli Collins commented on HDFS-3990: --- Pulled cleanup out to HDFS-4068. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476139#comment-13476139 ] Daryn Sharp commented on HDFS-3990: --- The caching is to prevent the unnecessary dns lookups that are a multiple of the number of datanodes - typically just to view a jsp or query json, or for other internal operations as well. Every time a node is checked against the include/exclude lists, it generates dns queries of 2X the datanodes. Counting the number of nodes causes a dns query for every datanode. Reassigning an ip should require no restart of the NN. The DN's are tracked by their ip and storage id. If a DN registers with a previously known ip or storage id, the existing node is updated with the fields in the new node id which contain a refreshed lookup. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476320#comment-13476320 ] Daryn Sharp commented on HDFS-3990: --- Pre-commit build is clean, but it failed to connect to jira: https://builds.apache.org/job/PreCommit-HDFS-Build/3331/consoleText NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476429#comment-13476429 ] Eli Collins commented on HDFS-3990: --- Why not use the DatanodeID hostName field instead of calling and caching InetAddress#getByName in the NN? The DN has already done the lookup (modulo the tests which use dfs.datanode.hostname) and this way we don't have to worry about inconsistency between the nodeAddr and the ipAddr/hostName fields. For sanity the NN could do a lookup when the DN registers and compare it's value to the DN reported one. Comments on this patch: - In registerDatanode why is OK to no longer update the registration info with the reported IP? - The comments in DatanodeManager (Mostly called inside an RPC... and Update the IP to the address of the RPC request..) are no longer accurate after your change. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476446#comment-13476446 ] Daryn Sharp commented on HDFS-3990: --- As best I can tell, the {{DatanodeID}}'s hostname is what the DN claims to be in the registration. The existing include/exclude list checks use the DN's ip and real hostname, not the one the node claimed to be in the registration. I'm trying to preserve existing behavior by just caching the socket's peer name at registration, so that resolved socket addr can be reused when checking the include/exclude lists. bq. In registerDatanode why is OK to no longer update the registration info with the reported IP? The ip actually is updated when {{setNodeAddr}} is called with the socket's peer. My bad on the comments. I'm not sure how I lost that change. I know the approach isn't perfect, and many of the fields could likely be folded together into the socket addr, but I'm trying to make the minimalist change to avoid a slew of dns queries that are having an adverse performance impact on multi-thousand node clusters. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476461#comment-13476461 ] Daryn Sharp commented on HDFS-3990: --- I'm also handling the case where a transient dns error may have occurred at the time a socket connected. The patch will attempt another lookup when the nodeAddr is requested. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476502#comment-13476502 ] Ravi Prakash commented on HDFS-3990: Thanks for your explanations Daryn! The src/main code looks reasonable to me. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476641#comment-13476641 ] Hadoop QA commented on HDFS-3990: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549228/hdfs-3990.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks org.apache.hadoop.hdfs.TestMiniDFSCluster org.apache.hadoop.hdfs.TestReplication org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3338//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3338//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475370#comment-13475370 ] Hadoop QA commented on HDFS-3990: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12548945/HDFS-3990.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3323//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3323//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3323//console This message is automatically generated. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475497#comment-13475497 ] Ravi Prakash commented on HDFS-3990: I'm sorry I've been out of the loop, but why would caching be the solution? If we want to reassign the IP addresse to hostname for a single node, would it require a restart of the NN? Is there a timeout with the caching? Even with a timeout I would have my reservations. Do nodes have hadoop generated unique IDs that we can leverage and match with IP addresses that we have cached? NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HDFS-3990.patch The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465597#comment-13465597 ] Daryn Sharp commented on HDFS-3990: --- Enabling a nscd host cache helped mitigate the issue by reducing load times to a few seconds. However the namespace read lock is highly undesirable, and the repeated dns lookups are questionable. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems
[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465598#comment-13465598 ] Daryn Sharp commented on HDFS-3990: --- Arun, please update the target version if you want to defer the fix to a later 2.x release. NN's health report has severe performance problems -- Key: HDFS-3990 URL: https://issues.apache.org/jira/browse/HDFS-3990 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical The dfshealth page will place a read lock on the namespace while it does a dns lookup for every DN. On a multi-thousand node cluster, this often results in 10s+ load time for the health page. 10 concurrent requests were found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira