[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169078#comment-14169078 ] Zhanwei Wang commented on HDFS-7207: bq. Since the current libhdfs3 C++ interface you have proposed exposes so many internal things, this will make it almost impossible to change anything after a release. All internal things are included in {{hdfs::internal}} namespace, and will not expose in the header files for users as API. Would you please point out which internal things are exposed? I'd like to fix them. If the API class has to reference the internal object, I introduce a forward declaration and add a point of internal object in the private section of API class. One exception is {{hdfs::internal::shared_ptr}} used in {{FileSystem}}, which was NOT in my original patch, I wrapped it hold a pointer of {{FileSystemImpl}} and do not use {{shared_ptr}} in the interface to avoid to expose {{hdfs::internal::shared_ptr}}. Exposing {{shared_ptr}} and {{std::string}} in API may introduce binary compatible issue. Since different C++ compiler and runtime may have the different incompatible ABI. To avoid this, the best way is to only use the C type in API. libhdfs3 should not expose exceptions in public C++ API --- Key: HDFS-7207 URL: https://issues.apache.org/jira/browse/HDFS-7207 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Colin Patrick McCabe Priority: Blocker Attachments: HDFS-7207.001.patch There are three major disadvantages of exposing exceptions in the public API: * Exposing exceptions in public APIs forces the downstream users to be compiled with {{-fexceptions}}, which might be infeasible in many use cases. * It forces other bindings to properly handle all C++ exceptions, which might be infeasible especially when the binding is generated by tools like SWIG. * It forces the downstream users to properly handle all C++ exceptions, which can be cumbersome as in certain cases it will lead to undefined behavior (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169085#comment-14169085 ] Zhanwei Wang commented on HDFS-7207: bq. A slightly simpler (probably subjective) approach might be to wrap things in the opposite way. That is, putting the error message / stack traces in the Status object directly and let hdfsGetLastError to get the string. I think this way is more straightfoward. Another issue is how to deal with the constructor, since it has no return value, what if the {{std::bad_alloc}} is throw in constructor? Currently the way I can figure out is to 1) use the factory pattern 2) or add a flag member in the interface class to indicate error. Any suggestion? libhdfs3 should not expose exceptions in public C++ API --- Key: HDFS-7207 URL: https://issues.apache.org/jira/browse/HDFS-7207 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Colin Patrick McCabe Priority: Blocker Attachments: HDFS-7207.001.patch There are three major disadvantages of exposing exceptions in the public API: * Exposing exceptions in public APIs forces the downstream users to be compiled with {{-fexceptions}}, which might be infeasible in many use cases. * It forces other bindings to properly handle all C++ exceptions, which might be infeasible especially when the binding is generated by tools like SWIG. * It forces the downstream users to properly handle all C++ exceptions, which can be cumbersome as in certain cases it will lead to undefined behavior (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169097#comment-14169097 ] Zhanwei Wang commented on HDFS-7207: In my previous patch, InputStream/OutputStream keep a shared_ptr of FileSystemImpl for an important reason, which is to avoid to use an invalid pointer of FileSystemImpl if the filesystem is destroyed before InputStream/OutputStream. {code} DB *db = DB::Open(); Iterator *it = db-(...); delete db; // bails out because the iterator it has leaked. {code} This way is good in my opinion, it makes the user to be more aware of the leaks, but a shared_ptr of DBImpl is still need to keep in Iterator to avoid the core dump if the user continue to use the iterator after {{delete db}} libhdfs3 should not expose exceptions in public C++ API --- Key: HDFS-7207 URL: https://issues.apache.org/jira/browse/HDFS-7207 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Colin Patrick McCabe Priority: Blocker Attachments: HDFS-7207.001.patch There are three major disadvantages of exposing exceptions in the public API: * Exposing exceptions in public APIs forces the downstream users to be compiled with {{-fexceptions}}, which might be infeasible in many use cases. * It forces other bindings to properly handle all C++ exceptions, which might be infeasible especially when the binding is generated by tools like SWIG. * It forces the downstream users to properly handle all C++ exceptions, which can be cumbersome as in certain cases it will lead to undefined behavior (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-6544) Broken Link for GFS in package.html
[ https://issues.apache.org/jira/browse/HDFS-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak M reassigned HDFS-6544: --- Assignee: Suraj Nayak M Broken Link for GFS in package.html --- Key: HDFS-6544 URL: https://issues.apache.org/jira/browse/HDFS-6544 Project: Hadoop HDFS Issue Type: Bug Reporter: Suraj Nayak M Assignee: Suraj Nayak M Priority: Minor Attachments: HDFS-6544.patch The link to GFS is currently pointing to http://labs.google.com/papers/gfs.html, which is broken. Change it to http://research.google.com/archive/gfs.html which has Abstract of the GFS paper along with link to the PDF version of the GFS Paper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6544) Broken Link for GFS in package.html
[ https://issues.apache.org/jira/browse/HDFS-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak M updated HDFS-6544: Status: Patch Available (was: Open) Broken Link for GFS in package.html --- Key: HDFS-6544 URL: https://issues.apache.org/jira/browse/HDFS-6544 Project: Hadoop HDFS Issue Type: Bug Reporter: Suraj Nayak M Assignee: Suraj Nayak M Priority: Minor Attachments: HDFS-6544.patch The link to GFS is currently pointing to http://labs.google.com/papers/gfs.html, which is broken. Change it to http://research.google.com/archive/gfs.html which has Abstract of the GFS paper along with link to the PDF version of the GFS Paper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169445#comment-14169445 ] Yongjun Zhang commented on HDFS-7146: - Hi [~aw] and [~brandonli], Thanks for the earlier review and discussion here. I created HADOOP-11195 per Allen's suggestion to merge the two existing mechanisms that caches user/group info to the hadoop-common area. Certainly I agree upon general software engineering principal of code sharing and its benefits. My original thought was that we could do things in different order. But let's try this route of fixing HADOOP-11195 first and then HDFS-7146. I might incorporate HDFS-7146 in the same fix of HADOOP-11195 though. NFS ID/Group lookup requires SSSD enumeration on the server --- Key: HDFS-7146 URL: https://issues.apache.org/jira/browse/HDFS-7146 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, HDFS-7146.003.patch The current implementation of the NFS UID and GID lookup works by running 'getent passwd' with an assumption that it will return the entire list of users available on the OS, local and remote (AD/etc.). This behaviour of the command is advised to be and is prevented by administrators in most secure setups to avoid excessive load to the ADs involved, as the # of users to be listed may be too large, and the repeated requests of ALL users not present in the cache would be too much for the AD infrastructure to bear. The NFS server should likely do lookups based on a specific UID request, via 'getent passwd UID', if the UID does not match a cached value. This reduces load on the LDAP backed infrastructure. Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169519#comment-14169519 ] Benoy Antony commented on HDFS-7204: [~aw] , Can we avoid setting {{daemon=true}} for each component like balancer ? I may be missing the intention behind the internal boolean variable - {{daemon}}. Shouldn't it be set based on whether {{--daemon}} is part of the invocation ? balancer doesn't run as a daemon Key: HDFS-7204 URL: https://issues.apache.org/jira/browse/HDFS-7204 Project: Hadoop HDFS Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Allen Wittenauer Labels: newbie Attachments: HDFS-7204-01.patch, HDFS-7204.patch From HDFS-7184, minor issues with balancer: * daemon isn't set to true in hdfs to enable daemonization * start-balancer script has usage instead of hadoop_usage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7232) Populate hostname in httpfs audit log
[ https://issues.apache.org/jira/browse/HDFS-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HDFS-7232: - Attachment: HDFS-7232.patch Populate hostname in httpfs audit log - Key: HDFS-7232 URL: https://issues.apache.org/jira/browse/HDFS-7232 Project: Hadoop HDFS Issue Type: Bug Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Trivial Attachments: HDFS-7232.patch Currently httpfs audit logs do not log the request's IP address. Since they use hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/conf/httpfs-log4j.properties which already contains hostname, it would be nice to add code to populate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7232) Populate hostname in httpfs audit log
[ https://issues.apache.org/jira/browse/HDFS-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HDFS-7232: - Affects Version/s: 3.0.0 Status: Patch Available (was: Open) This is a simple patch for audit logs. I'm not sure if it'll require unit-tests, but for now I don't have them. Populate hostname in httpfs audit log - Key: HDFS-7232 URL: https://issues.apache.org/jira/browse/HDFS-7232 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Trivial Attachments: HDFS-7232.patch Currently httpfs audit logs do not log the request's IP address. Since they use hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/conf/httpfs-log4j.properties which already contains hostname, it would be nice to add code to populate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169561#comment-14169561 ] Brandon Li commented on HDFS-7146: -- [~yzhangal], thanks for filing HADOOP-11195 to track the effort. It's good to not mix bug fixes and code improvement in the same JIRA. NFS ID/Group lookup requires SSSD enumeration on the server --- Key: HDFS-7146 URL: https://issues.apache.org/jira/browse/HDFS-7146 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, HDFS-7146.003.patch The current implementation of the NFS UID and GID lookup works by running 'getent passwd' with an assumption that it will return the entire list of users available on the OS, local and remote (AD/etc.). This behaviour of the command is advised to be and is prevented by administrators in most secure setups to avoid excessive load to the ADs involved, as the # of users to be listed may be too large, and the repeated requests of ALL users not present in the cache would be too much for the AD infrastructure to bear. The NFS server should likely do lookups based on a specific UID request, via 'getent passwd UID', if the UID does not match a cached value. This reduces load on the LDAP backed infrastructure. Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6544) Broken Link for GFS in package.html
[ https://issues.apache.org/jira/browse/HDFS-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169570#comment-14169570 ] Hadoop QA commented on HDFS-6544: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650618/HDFS-6544.patch against trunk revision e8a31f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8403//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8403//artifact/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8403//console This message is automatically generated. Broken Link for GFS in package.html --- Key: HDFS-6544 URL: https://issues.apache.org/jira/browse/HDFS-6544 Project: Hadoop HDFS Issue Type: Bug Reporter: Suraj Nayak M Assignee: Suraj Nayak M Priority: Minor Attachments: HDFS-6544.patch The link to GFS is currently pointing to http://labs.google.com/papers/gfs.html, which is broken. Change it to http://research.google.com/archive/gfs.html which has Abstract of the GFS paper along with link to the PDF version of the GFS Paper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7232) Populate hostname in httpfs audit log
[ https://issues.apache.org/jira/browse/HDFS-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169571#comment-14169571 ] Hadoop QA commented on HDFS-7232: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674532/HDFS-7232.patch against trunk revision 793dbf2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8404//console This message is automatically generated. Populate hostname in httpfs audit log - Key: HDFS-7232 URL: https://issues.apache.org/jira/browse/HDFS-7232 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Trivial Attachments: HDFS-7232.patch Currently httpfs audit logs do not log the request's IP address. Since they use hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/conf/httpfs-log4j.properties which already contains hostname, it would be nice to add code to populate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7236) TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in trunk
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169574#comment-14169574 ] Jing Zhao commented on HDFS-7236: - +1. I will commit the patch shortly. TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in trunk Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225) at java.lang.Thread.run(Thread.java:662) {code} AND {code} 2014-10-11 12:38:28,552 WARN datanode.DataNode (BPServiceActor.java:offerService(751)) - RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.io.IOException): Got incremental
[jira] [Updated] (HDFS-7236) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7236: Summary: Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots (was: TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in trunk) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225) at java.lang.Thread.run(Thread.java:662) {code} AND {code} 2014-10-11 12:38:28,552 WARN datanode.DataNode (BPServiceActor.java:offerService(751)) - RemoteException in offerService
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169585#comment-14169585 ] Yongjun Zhang commented on HDFS-7146: - Hi [~brandonli], thanks for the feedback, your point is well taken. NFS ID/Group lookup requires SSSD enumeration on the server --- Key: HDFS-7146 URL: https://issues.apache.org/jira/browse/HDFS-7146 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, HDFS-7146.003.patch The current implementation of the NFS UID and GID lookup works by running 'getent passwd' with an assumption that it will return the entire list of users available on the OS, local and remote (AD/etc.). This behaviour of the command is advised to be and is prevented by administrators in most secure setups to avoid excessive load to the ADs involved, as the # of users to be listed may be too large, and the repeated requests of ALL users not present in the cache would be too much for the AD infrastructure to bear. The NFS server should likely do lookups based on a specific UID request, via 'getent passwd UID', if the UID does not match a cached value. This reduces load on the LDAP backed infrastructure. Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169588#comment-14169588 ] Haohui Mai commented on HDFS-7207: -- bq. it has no return value, what if the std::bad_alloc is throw in constructor? The interface can throw no exceptions at all, even {{std::bad_alloc}}. A static call is sufficient to take care of it. For example: {code} static Status Create(FileSystem **fsptr); {code} Note that the {{Status}} object also allows getting the users to get the information in the form of strings. bq. but a shared_ptr of DBImpl is still need to keep in Iterator to avoid the core dump if the user continue to use the iterator after delete db The expected behavior is to crash right at the line of {{delete db}}. It avoids any uses of dangling iterators. Obviously the code needs to keep a refcount somewhere, but that way the code does not need to expose {{std::shared_ptr}} in the interface. libhdfs3 should not expose exceptions in public C++ API --- Key: HDFS-7207 URL: https://issues.apache.org/jira/browse/HDFS-7207 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Colin Patrick McCabe Priority: Blocker Attachments: HDFS-7207.001.patch There are three major disadvantages of exposing exceptions in the public API: * Exposing exceptions in public APIs forces the downstream users to be compiled with {{-fexceptions}}, which might be infeasible in many use cases. * It forces other bindings to properly handle all C++ exceptions, which might be infeasible especially when the binding is generated by tools like SWIG. * It forces the downstream users to properly handle all C++ exceptions, which can be cumbersome as in certain cases it will lead to undefined behavior (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6544) Broken Link for GFS in package.html
[ https://issues.apache.org/jira/browse/HDFS-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169593#comment-14169593 ] Haohui Mai commented on HDFS-6544: -- +1. The tests and release audit warnings are unrelated. I'll commit it shortly. Broken Link for GFS in package.html --- Key: HDFS-6544 URL: https://issues.apache.org/jira/browse/HDFS-6544 Project: Hadoop HDFS Issue Type: Bug Reporter: Suraj Nayak M Assignee: Suraj Nayak M Priority: Minor Attachments: HDFS-6544.patch The link to GFS is currently pointing to http://labs.google.com/papers/gfs.html, which is broken. Change it to http://research.google.com/archive/gfs.html which has Abstract of the GFS paper along with link to the PDF version of the GFS Paper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7236) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7236: Resolution: Fixed Fix Version/s: 2.6.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed this to trunk, branch-2 and branch-2.6.0. Thanks for the contribution, [~yzhangal]! Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Fix For: 2.6.0 Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225) at java.lang.Thread.run(Thread.java:662) {code} AND {code} 2014-10-11 12:38:28,552 WARN
[jira] [Commented] (HDFS-7236) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169600#comment-14169600 ] Hudson commented on HDFS-7236: -- FAILURE: Integrated in Hadoop-trunk-Commit #6249 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6249/]) HDFS-7236. Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots. Contributed by Yongjun Zhang. (jing9: rev 98ac9f26c5b3bceb073ce444e42dc89d19132a1f) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestOpenFilesWithSnapshot.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Fix For: 2.6.0 Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at
[jira] [Updated] (HDFS-7236) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7236: Affects Version/s: 2.6.0 Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Fix For: 2.6.0 Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225) at java.lang.Thread.run(Thread.java:662) {code} AND {code} 2014-10-11 12:38:28,552 WARN datanode.DataNode (BPServiceActor.java:offerService(751)) - RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.io.IOException): Got incremental block report from
[jira] [Updated] (HDFS-7236) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7236: Target Version/s: 2.6.0 Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Fix For: 2.6.0 Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225) at java.lang.Thread.run(Thread.java:662) {code} AND {code} 2014-10-11 12:38:28,552 WARN datanode.DataNode (BPServiceActor.java:offerService(751)) - RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.io.IOException): Got incremental block report from unregistered
[jira] [Updated] (HDFS-6544) Broken Link for GFS in package.html
[ https://issues.apache.org/jira/browse/HDFS-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-6544: - Resolution: Fixed Fix Version/s: 2.6.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed the patch to trunk and branch-2. Thanks [~snayakm] for the contribution. Broken Link for GFS in package.html --- Key: HDFS-6544 URL: https://issues.apache.org/jira/browse/HDFS-6544 Project: Hadoop HDFS Issue Type: Bug Reporter: Suraj Nayak M Assignee: Suraj Nayak M Priority: Minor Fix For: 2.6.0 Attachments: HDFS-6544.patch The link to GFS is currently pointing to http://labs.google.com/papers/gfs.html, which is broken. Change it to http://research.google.com/archive/gfs.html which has Abstract of the GFS paper along with link to the PDF version of the GFS Paper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7222) Expose DataNode network errors as a metric
[ https://issues.apache.org/jira/browse/HDFS-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7222: --- Attachment: HDFS-7222.001.patch Attaching some diffs for a test run. Expose DataNode network errors as a metric -- Key: HDFS-7222 URL: https://issues.apache.org/jira/browse/HDFS-7222 Project: Hadoop HDFS Issue Type: New Feature Components: datanode Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7222.001.patch It would be useful to track datanode network errors and expose them as a metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6544) Broken Link for GFS in package.html
[ https://issues.apache.org/jira/browse/HDFS-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169614#comment-14169614 ] Hudson commented on HDFS-6544: -- FAILURE: Integrated in Hadoop-trunk-Commit #6250 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6250/]) HDFS-6544. Broken Link for GFS in package.html. Contributed by Suraj Nayak M. (wheat9: rev 53100318ea20c53c4d810dedfd50b88f9f32c1dc) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/package.html Broken Link for GFS in package.html --- Key: HDFS-6544 URL: https://issues.apache.org/jira/browse/HDFS-6544 Project: Hadoop HDFS Issue Type: Bug Reporter: Suraj Nayak M Assignee: Suraj Nayak M Priority: Minor Fix For: 2.6.0 Attachments: HDFS-6544.patch The link to GFS is currently pointing to http://labs.google.com/papers/gfs.html, which is broken. Change it to http://research.google.com/archive/gfs.html which has Abstract of the GFS paper along with link to the PDF version of the GFS Paper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169617#comment-14169617 ] Allen Wittenauer commented on HDFS-7204: [~benoyantony], I promise there is a method to the madness. TL;DR: No, yes, no. Longer: In branch-2 and previous, daemons were handled via wrapping standard command lines. If we concentrate on the functionality (vs. the code rot...) this has some interesting (and inconsistent) results, especially around logging and pid files. If you run the *-daemon.* version, you got a pid file and hadoop.root.logger is set to be INFO,(something). When a daemon is run in non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated and hadoop.root.logger is kept as INFO,console. With no pid file generated, it is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in straight up mode again. It also means that one needs to pull apart the process list to determine safely determine the status of the daemon since pid files aren't always created. This made building custom init scripts fraught with danger. This inconsistency has been a point of frustration for many operations teams. In branch-3/post-HADOOP-9902, there is a slight change in the above functionality and one of the key reasons why this is an incompatible change. Sub-commands that were intended to run as daemons (either fully, e.g., namenode or partially, e.g. balancer) have all of this handling consolidated, helping to eliminate code rot as well as providing a consistent user experience across projects. daemon=true, which is a per-script local, but is consistent across the hadoop sub-projects, tells the latter parts of the shell code that this sub-command needs to have some extra-handling enabled beyond the normal commands. In particular, daemon=true's will always get pid and out files. They will prevent two being run on the same machine by the same user simultaneously (see footnote 1, however). They get some extra options on the java command line. Etc, etc. So where does \-\-daemon come in? The value of that is stored in a global called HADOOP_DAEMON_MODE. If the user doesn't set it specifically, it defaults to 'default'. This was done to allow the code to mostly replicate the behavior of branch-2 and previous when the *-daemon.sh code was NOT used. In other words, \-\-daemon default (or no value provided), let's commands like hdfs namenode still run in the foreground, just now with pid and out files. \-\-daemon start does the disown (previously a nohup), change the logging output from HADOOP_ROOT_LOGGER to HADOOP_DAEMON_ROOT_LOGGER, add some extra command line options, etc, etc similar to the *-daemon.sh commands. What happens if daemon mode is set for all commands? The big thing is the pid and out file creation and the checks around it. A user would only ever be able to execute one 'hadoop fs' command at a time because of the pid file! Less than ideal. :) To summarize, daemon=true tells the code that --daemon actually means something to the sub-command. Otherwise, --daemon is ignored. 1-... unless HADOOP_IDENT_STRING is modified appropriately. This means that in branch-3, it is now possible to run two secure datanodes on the same machine as the same user, since all of the logs, pids, and outs, take that into consideration! QA folks should be very happy. :) balancer doesn't run as a daemon Key: HDFS-7204 URL: https://issues.apache.org/jira/browse/HDFS-7204 Project: Hadoop HDFS Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Allen Wittenauer Labels: newbie Attachments: HDFS-7204-01.patch, HDFS-7204.patch From HDFS-7184, minor issues with balancer: * daemon isn't set to true in hdfs to enable daemonization * start-balancer script has usage instead of hadoop_usage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169617#comment-14169617 ] Allen Wittenauer edited comment on HDFS-7204 at 10/13/14 5:44 PM: -- [~benoyantony], I promise there is a method to the madness. TL;DR: No, yes, no. Longer: In branch-2 and previous, daemons were handled via wrapping standard command lines. If we concentrate on the functionality (vs. the code rot...) this has some interesting (and inconsistent) results, especially around logging and pid files. If you run the *-daemon version, you got a pid file and hadoop.root.logger is set to be INFO,(something). When a daemon is run in non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated and hadoop.root.logger is kept as INFO,console. With no pid file generated, it is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in straight up mode again. It also means that one needs to pull apart the process list to determine safely determine the status of the daemon since pid files aren't always created. This made building custom init scripts fraught with danger. This inconsistency has been a point of frustration for many operations teams. In branch-3/post-HADOOP-9902, there is a slight change in the above functionality and one of the key reasons why this is an incompatible change. Sub-commands that were intended to run as daemons (either fully, e.g., namenode or partially, e.g. balancer) have all of this handling consolidated, helping to eliminate code rot as well as providing a consistent user experience across projects. daemon=true, which is a per-script local, but is consistent across the hadoop sub-projects, tells the latter parts of the shell code that this sub-command needs to have some extra-handling enabled beyond the normal commands. In particular, daemon=true's will always get pid and out files. They will prevent two being run on the same machine by the same user simultaneously (see footnote 1, however). They get some extra options on the java command line. Etc, etc. So where does \-\-daemon come in? The value of that is stored in a global called HADOOP_DAEMON_MODE. If the user doesn't set it specifically, it defaults to 'default'. This was done to allow the code to mostly replicate the behavior of branch-2 and previous when the *-daemon.sh code was NOT used. In other words, \-\-daemon default (or no value provided), let's commands like hdfs namenode still run in the foreground, just now with pid and out files. \-\-daemon start does the disown (previously a nohup), change the logging output from HADOOP_ROOT_LOGGER to HADOOP_DAEMON_ROOT_LOGGER, add some extra command line options, etc, etc similar to the *-daemon.sh commands. What happens if daemon mode is set for all commands? The big thing is the pid and out file creation and the checks around it. A user would only ever be able to execute one 'hadoop fs' command at a time because of the pid file! Less than ideal. :) To summarize, daemon=true tells the code that --daemon actually means something to the sub-command. Otherwise, --daemon is ignored. 1-... unless HADOOP_IDENT_STRING is modified appropriately. This means that in branch-3, it is now possible to run two secure datanodes on the same machine as the same user, since all of the logs, pids, and outs, take that into consideration! QA folks should be very happy. :) was (Author: aw): [~benoyantony], I promise there is a method to the madness. TL;DR: No, yes, no. Longer: In branch-2 and previous, daemons were handled via wrapping standard command lines. If we concentrate on the functionality (vs. the code rot...) this has some interesting (and inconsistent) results, especially around logging and pid files. If you run the *-daemon.* version, you got a pid file and hadoop.root.logger is set to be INFO,(something). When a daemon is run in non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated and hadoop.root.logger is kept as INFO,console. With no pid file generated, it is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in straight up mode again. It also means that one needs to pull apart the process list to determine safely determine the status of the daemon since pid files aren't always created. This made building custom init scripts fraught with danger. This inconsistency has been a point of frustration for many operations teams. In branch-3/post-HADOOP-9902, there is a slight change in the above functionality and one of the key reasons why this is an incompatible change. Sub-commands that were intended to run as daemons (either fully, e.g., namenode or partially, e.g. balancer) have all of this handling consolidated, helping to eliminate code rot as well as providing a consistent user experience across
[jira] [Updated] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7204: --- Priority: Blocker (was: Major) balancer doesn't run as a daemon Key: HDFS-7204 URL: https://issues.apache.org/jira/browse/HDFS-7204 Project: Hadoop HDFS Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Allen Wittenauer Priority: Blocker Labels: newbie Attachments: HDFS-7204-01.patch, HDFS-7204.patch From HDFS-7184, minor issues with balancer: * daemon isn't set to true in hdfs to enable daemonization * start-balancer script has usage instead of hadoop_usage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7204: --- Affects Version/s: 3.0.0 balancer doesn't run as a daemon Key: HDFS-7204 URL: https://issues.apache.org/jira/browse/HDFS-7204 Project: Hadoop HDFS Issue Type: Bug Components: scripts Affects Versions: 3.0.0 Reporter: Allen Wittenauer Assignee: Allen Wittenauer Priority: Blocker Labels: newbie Attachments: HDFS-7204-01.patch, HDFS-7204.patch From HDFS-7184, minor issues with balancer: * daemon isn't set to true in hdfs to enable daemonization * start-balancer script has usage instead of hadoop_usage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169617#comment-14169617 ] Allen Wittenauer edited comment on HDFS-7204 at 10/13/14 5:47 PM: -- [~benoyantony], I promise there is a method to the madness. TL;DR: No, yes, no. Longer: In branch-2 and previous, daemons were handled via wrapping standard command lines. If we concentrate on the functionality (vs. the code rot...) this has some interesting (and inconsistent) results, especially around logging and pid files. If you run the *-daemon version, you got a pid file and hadoop.root.logger is set to be INFO,(something). When a daemon is run in non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated and hadoop.root.logger is kept as INFO,console. With no pid file generated, it is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in straight up mode again. It also means that one needs to pull apart the process list to safely determine the status of the daemon since pid files aren't always created. This made building custom init scripts fraught with danger. This inconsistency has been a point of frustration for many operations teams. In branch-3/post-HADOOP-9902, there is a slight change in the above functionality and one of the key reasons why this is an incompatible change. Sub-commands that were intended to run as daemons (either fully, e.g., namenode or partially, e.g. balancer) have all of this handling consolidated, helping to eliminate code rot as well as providing a consistent user experience across projects. daemon=true, which is a per-script local, but is consistent across the hadoop sub-projects, tells the latter parts of the shell code that this sub-command needs to have some extra-handling enabled beyond the normal commands. In particular, daemon=true's will always get pid and out files. They will prevent two being run on the same machine by the same user simultaneously (see footnote 1, however). They get some extra options on the java command line. Etc, etc. So where does \-\-daemon come in? The value of that is stored in a global called HADOOP_DAEMON_MODE. If the user doesn't set it specifically, it defaults to 'default'. This was done to allow the code to mostly replicate the behavior of branch-2 and previous when the *-daemon.sh code was NOT used. In other words, \-\-daemon default (or no value provided), let's commands like hdfs namenode still run in the foreground, just now with pid and out files. \-\-daemon start does the disown (previously a nohup), change the logging output from HADOOP_ROOT_LOGGER to HADOOP_DAEMON_ROOT_LOGGER, add some extra command line options, etc, etc similar to the *-daemon.sh commands. What happens if daemon mode is set for all commands? The big thing is the pid and out file creation and the checks around it. A user would only ever be able to execute one 'hadoop fs' command at a time because of the pid file! Less than ideal. :) To summarize, daemon=true tells the code that --daemon actually means something to the sub-command. Otherwise, --daemon is ignored. 1-... unless HADOOP_IDENT_STRING is modified appropriately. This means that in branch-3, it is now possible to run two secure datanodes on the same machine as the same user, since all of the logs, pids, and outs, take that into consideration! QA folks should be very happy. :) was (Author: aw): [~benoyantony], I promise there is a method to the madness. TL;DR: No, yes, no. Longer: In branch-2 and previous, daemons were handled via wrapping standard command lines. If we concentrate on the functionality (vs. the code rot...) this has some interesting (and inconsistent) results, especially around logging and pid files. If you run the *-daemon version, you got a pid file and hadoop.root.logger is set to be INFO,(something). When a daemon is run in non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated and hadoop.root.logger is kept as INFO,console. With no pid file generated, it is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in straight up mode again. It also means that one needs to pull apart the process list to determine safely determine the status of the daemon since pid files aren't always created. This made building custom init scripts fraught with danger. This inconsistency has been a point of frustration for many operations teams. In branch-3/post-HADOOP-9902, there is a slight change in the above functionality and one of the key reasons why this is an incompatible change. Sub-commands that were intended to run as daemons (either fully, e.g., namenode or partially, e.g. balancer) have all of this handling consolidated, helping to eliminate code rot as well as providing a consistent user experience across projects.
[jira] [Comment Edited] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167620#comment-14167620 ] Allen Wittenauer edited comment on HDFS-7231 at 10/13/14 5:51 PM: -- Argh: another point: with namenode -finalize being taken away, this scenario is pretty much unsolvable without manual intervention and knowledge of how the NN stores stuff on disk. was (Author: aw): Argh: another point: with namenode -finalize being taken away, this scenario is pretty much unsolvable. rollingupgrade needs some guard rails - Key: HDFS-7231 URL: https://issues.apache.org/jira/browse/HDFS-7231 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7231: --- Priority: Blocker (was: Major) rollingupgrade needs some guard rails - Key: HDFS-7231 URL: https://issues.apache.org/jira/browse/HDFS-7231 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Allen Wittenauer Priority: Blocker See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7231: --- Affects Version/s: 2.6.0 rollingupgrade needs some guard rails - Key: HDFS-7231 URL: https://issues.apache.org/jira/browse/HDFS-7231 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Allen Wittenauer See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169634#comment-14169634 ] Allen Wittenauer commented on HDFS-7231: Verified the same crappy experience exists in the 2.6 branch. Marking this as a blocker since this will be the last release for everyone's precious JDK 1.6 support. I'd love to hear some options from the peanut gallery on how to improve this so users aren't left with a potential time bomb on their hands. Alias -upgrade to -rollingupgrade? Bring nn -finalize back? Auto-finalize? rollingupgrade needs some guard rails - Key: HDFS-7231 URL: https://issues.apache.org/jira/browse/HDFS-7231 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Allen Wittenauer Priority: Blocker See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6884) Include the hostname in HTTPFS log filenames
[ https://issues.apache.org/jira/browse/HDFS-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-6884: --- Attachment: (was: HDFS-6884.patch) Include the hostname in HTTPFS log filenames Key: HDFS-6884 URL: https://issues.apache.org/jira/browse/HDFS-6884 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.5.0 Reporter: Andrew Wang Assignee: Alejandro Abdelnur It'd be good to include the hostname in the httpfs log filenames. Right now we have httpfs.log and httpfs-audit.log, it'd be nice to have e.g. httpfs-${hostname}.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (HDFS-6884) Include the hostname in HTTPFS log filenames
[ https://issues.apache.org/jira/browse/HDFS-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-6884: --- Comment: was deleted (was: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674259/HDFS-6884.patch against trunk revision d3d3d47. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 12 warning messages. See https://builds.apache.org/job/PreCommit-HDFS-Build/8395//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8395//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8395//artifact/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8395//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8395//console This message is automatically generated.) Include the hostname in HTTPFS log filenames Key: HDFS-6884 URL: https://issues.apache.org/jira/browse/HDFS-6884 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.5.0 Reporter: Andrew Wang Assignee: Alejandro Abdelnur It'd be good to include the hostname in the httpfs log filenames. Right now we have httpfs.log and httpfs-audit.log, it'd be nice to have e.g. httpfs-${hostname}.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6884) Include the hostname in HTTPFS log filenames
[ https://issues.apache.org/jira/browse/HDFS-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169645#comment-14169645 ] Allen Wittenauer commented on HDFS-6884: (Previous patch and QA result was deleted by request of the contributor to prevent confusion, as it was intended for a different issue.) Include the hostname in HTTPFS log filenames Key: HDFS-6884 URL: https://issues.apache.org/jira/browse/HDFS-6884 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.5.0 Reporter: Andrew Wang Assignee: Alejandro Abdelnur It'd be good to include the hostname in the httpfs log filenames. Right now we have httpfs.log and httpfs-audit.log, it'd be nice to have e.g. httpfs-${hostname}.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7236) Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots
[ https://issues.apache.org/jira/browse/HDFS-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169647#comment-14169647 ] Yongjun Zhang commented on HDFS-7236: - Many thanks [~jingzhao]! FYI, I just took a look at HDFS-7226 (TestDNFencing.testQueueingWithAppend failed often in latest test) a bit and found that it seems to be related to HDFS-7217 change too. However, it's more subtle there, and it appears to have something to do with hflush. I will look more at that jira a bit later. Fix TestOpenFilesWithSnapshot#testOpenFilesWithMultipleSnapshots Key: HDFS-7236 URL: https://issues.apache.org/jira/browse/HDFS-7236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Fix For: 2.6.0 Attachments: HDFS-7236.001.patch Per the following report {code} Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-Hdfs-trunk THERE ARE 4 builds (out of 5) that have failed tests in the past 7 days, as listed below: ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/testReport (2014-10-11 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/testReport (2014-10-10 04:30:40) Failed test: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failed test: org.apache.hadoop.tracing.TestTracing.testReadTraceHooks Failed test: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots Failed test: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress Failed test: org.apache.hadoop.tracing.TestTracing.testWriteTraceHooks ... Among 5 runs examined, all failed tests #failedRuns: testName: 4: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 2: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 2: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestDeadDatanode.testDeadDatanode ... {code} TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots failed in most recent two runs in trunk. Creating this jira for it (The other two tests that failed more often were reported in separate jira HDFS-7221 and HDFS-7226) Symptom: {code} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {code} AND {code} 2014-10-11 12:38:24,385 ERROR datanode.DataNode (DataXceiver.java:run(243)) - 127.0.0.1:55303:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:32949 dst: /127.0.0.1:55303 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:196) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:468) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:772) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:720) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at
[jira] [Assigned] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang reassigned HDFS-7226: --- Assignee: Yongjun Zhang TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7090) Use unbuffered writes when persisting in-memory replicas
[ https://issues.apache.org/jira/browse/HDFS-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HDFS-7090: Resolution: Fixed Status: Resolved (was: Patch Available) Xiaoyu pointed out that the test failures are unrelated. {{TestOpenFilesWithSnapshot}} just got fixed in HDFS-7236. {{TestDNFencing}} is an existing failure tracked in HDFS-7226. I committed this to trunk. Xiaoyu, thank you for contributing the patch. Use unbuffered writes when persisting in-memory replicas Key: HDFS-7090 URL: https://issues.apache.org/jira/browse/HDFS-7090 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.6.0 Reporter: Arpit Agarwal Assignee: Xiaoyu Yao Fix For: 3.0.0 Attachments: HDFS-7090.0.patch, HDFS-7090.1.patch, HDFS-7090.2.patch, HDFS-7090.3.patch, HDFS-7090.4.patch The LazyWriter thread just uses {{FileUtils.copyFile}} to copy block files to persistent storage. It would be better to use unbuffered writes to avoid churning page cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7090) Use unbuffered writes when persisting in-memory replicas
[ https://issues.apache.org/jira/browse/HDFS-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169657#comment-14169657 ] Hudson commented on HDFS-7090: -- FAILURE: Integrated in Hadoop-trunk-Commit #6251 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6251/]) HDFS-7090. Use unbuffered writes when persisting in-memory replicas. Contributed by Xiaoyu Yao. (cnauroth: rev 1770bb942f9ebea38b6811ba0bc3cc249ef3ccbb) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/Errno.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/errno_enum.c Use unbuffered writes when persisting in-memory replicas Key: HDFS-7090 URL: https://issues.apache.org/jira/browse/HDFS-7090 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.6.0 Reporter: Arpit Agarwal Assignee: Xiaoyu Yao Fix For: 3.0.0 Attachments: HDFS-7090.0.patch, HDFS-7090.1.patch, HDFS-7090.2.patch, HDFS-7090.3.patch, HDFS-7090.4.patch The LazyWriter thread just uses {{FileUtils.copyFile}} to copy block files to persistent storage. It would be better to use unbuffered writes to avoid churning page cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169673#comment-14169673 ] Colin Patrick McCabe commented on HDFS-7235: Thanks for looking at this, Yongjun. I don't understand why we need a new function named {{FsDatasetSpi#isInvalidBlockDueToNonexistentBlockFile}}. The JavaDoc for {{FsDatasetSpi#isValid}} says that it checks if the block exist\[s\] and has the given state and it's clear from the code that this is what it actually implements. We start by calling isValid... {code} private void transferBlock(ExtendedBlock block, DatanodeInfo[] xferTargets, StorageType[] xferTargetStorageTypes) throws IOException { BPOfferService bpos = getBPOSForBlock(block); DatanodeRegistration bpReg = getDNRegistrationForBP(block.getBlockPoolId()); if (!data.isValidBlock(block)) { // block does not exist or is under-construction String errStr = Can't send invalid block + block; LOG.info(errStr); bpos.trySendErrorReport(DatanodeProtocol.INVALID_BLOCK, errStr); return; } ... {code} {{isValid}} checks whether the block file exists... {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} So there's no need for a new function. isValid already does what you want. bq. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Is this a problem with {{BPOfferService#trySendErrorReport}}? If so, it seems like we should fix it there. I can see that BPServiceActor#trySendErrorReport calls {{NameNodeRpc#errorReport}}, whereas your patch calls {{NameNodeRpc#reportBadBlocks}}. What's the reason for this change, and does it fix the bug described above? Can not decommission DN which has invalid block due to bad disk --- Key: HDFS-7235 URL: https://issues.apache.org/jira/browse/HDFS-7235 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7235.001.patch When to decommission a DN, the process hangs. What happens is, when NN chooses a replica as a source to replicate data on the to-be-decommissioned DN to other DNs, it favors choosing this DN to-be-decommissioned as the source of transfer (see BlockManager.java). However, because of the bad disk, the DN would detect the source block to be transfered as invalidBlock with the following logic in FsDatasetImpl.java: {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} The reason that this method returns false (detecting invalid block) is because the block file doesn't exist due to bad disk in this case. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
Tsz Wo Nicholas Sze created HDFS-7237: - Summary: namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169714#comment-14169714 ] Yongjun Zhang commented on HDFS-7235: - Hi Colin, Thanks a lot for the review. The key issue identified for the original symptom was, when a block is detected as invalid by the existing isValid() method, we call SendErrorReport() which just log a message there, and Namenode doesn't do more than logging the message for this call, so NameNode doesn't know the block is bad. What I did was, I separate the reasons for isValid to be false to two parts, - if it's false because getBlockFile().exists() , call reportBadBlocks, so NameNode will record the bad block for future reference. - if it's false because either replicaInfo == null OR replicaInfo.getState() != state, it still calls SendErrorReport() like before. Actually for this case, the state has to be FINALIZED. We don't want to report badBlock for state that's RBW for example. If we make the change in SendErrorReport, that means we need to change the behavior of this method, to also call reportBadBlocks from there conditionally, which is not clean to me, because SendErrorReport is supposed to just send error report. Wonder if this explanation makes sense to you? Thanks. Can not decommission DN which has invalid block due to bad disk --- Key: HDFS-7235 URL: https://issues.apache.org/jira/browse/HDFS-7235 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7235.001.patch When to decommission a DN, the process hangs. What happens is, when NN chooses a replica as a source to replicate data on the to-be-decommissioned DN to other DNs, it favors choosing this DN to-be-decommissioned as the source of transfer (see BlockManager.java). However, because of the bad disk, the DN would detect the source block to be transfered as invalidBlock with the following logic in FsDatasetImpl.java: {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} The reason that this method returns false (detecting invalid block) is because the block file doesn't exist due to bad disk in this case. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7055) Add tracing to DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169768#comment-14169768 ] Colin Patrick McCabe commented on HDFS-7055: Nicholas, I apologize if these findbugs issues inconvenienced you. I have filed HADOOP-11197 to make test-patch.sh more robust to issues like HADOOP-11178. I would appreciate a review on HDFS-7227. Thanks also to Yongjun for fixing HDFS-7194 (introduced by me) and HDFS-7169 (introduced by Nicholas). Add tracing to DFSInputStream - Key: HDFS-7055 URL: https://issues.apache.org/jira/browse/HDFS-7055 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Fix For: 2.7.0 Attachments: HDFS-7055.002.patch, HDFS-7055.003.patch, HDFS-7055.004.patch, HDFS-7055.005.patch, screenshot-get-1mb.005.png, screenshot-get-1mb.png Add tracing to DFSInputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7237: -- Attachment: h7237_20141013.patch h7237_20141013.patch: checks the index. namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7237: -- Status: Patch Available (was: Open) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users
[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169779#comment-14169779 ] Maysam Yabandeh commented on HDFS-6982: --- Thanks [~andrew.wang] for the well-detailed review. I will submit a new patch soon. In the meanwhile, let me double check a couple of points with you. bq. Since I don't see any modifications to any existing files, I'm also wondering how this is exposed to JMX or on the webUI. You are right. I was not sure where is the best place to integrate nntop with nn. I will pick a place and we can update it later. bq. There's only a {{getDefaultRollingWindow}} class, no other ways of constructing a RollingWindow. The design doc envisions two interfaces to access the top users. One is jmx that requires rolling window over only one reporting period, say 1 minute. Jmx data however are most useful when they are integrated with an external graphing tool. To also allow users with small clusters to benefit from the data computed by nntop, we also provide an html interface, which has no graphing capability. This basic interface unfortunately does not give a sense of *trend* to the viewer. To compensate for that, the html page will show the top users over multiple time periods, say 1, 5, 25 minutes; ergo why we have multiple rolling window periods in nntop. One of them however is used for jmx interface, which is specific by {{getDefaultRollingWindow}}. About the html interface, I excluded it from this patch for two reasons. First, i figured it is better to keep this patch as small as possible and work on the html interface patch on a separate jira. Second reason was that previously I had used yarn html utils and I am gonna have to rewrite that part using html utils which are standard to the hdfs project. bq. How do we configure multiple reporting periods? via some conf params. I will make sure that the docs reflect that properly. bq. WEB_PORT and DEFAULT_WEB_PORT seem to be unused you right. they are supposed to be used by the html interface. but I should remove them from this patch. bq. getCmdTotal and getTopMetricsRecordPrefix static getters are only used in TopMetrics, that might be a better home. they will later be used by the html interface as well. the html interface will show the total operations on top and then details of each command afterwards. bq. Rather than MIN_2_MS, could we have a long array with the default periods, i.e. DEFAULT_REPORTING_PERIODS? In addition to the previous explanation about multiple reporting periods for the html view, I should add the them reporting periods are expected to be specified in the conf file. I dropped the method that reads them from the conf file from the patch since it was invoked only via the html interface. But I guess I should put it back to avoid confusion. bq. report, we construct the permStr, but don't actually use it. you are right. I actually can drop src, dst, and also status. At the beginning the vision for nntop was to also report hot directories, etc. and that is why we kept the full details in the report method. but i guess we can always put such details back if at some point those visions were to pursued. bq. report, I don't think we need the catch for Throwable t, no checked exceptions are being thrown? the idea was that any unexpected problem from a programming bug in nntop should not crash the name node. bq. TopUtil: This stuff isn't shared much, seems like we could just move things to where they're used TopUtil was much fatter when it also included html view util functions. Also html view will also be a user of TopUtil. bq. TopMetricsCollector: Is this used? yeah, by the html view. I should drop it from this patch. nntop: top-like tool for name node users - Key: HDFS-6982 URL: https://issues.apache.org/jira/browse/HDFS-6982 Project: Hadoop HDFS Issue Type: New Feature Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, nntop-design-v1.pdf In this jira we motivate the need for nntop, a tool that, similarly to what top does in Linux, gives the list of top users of the HDFS name node and gives insight about which users are sending majority of each traffic type to the name node. This information turns out to be the most critical when the name node is under pressure and the HDFS admin needs to know which user is hammering the name node and with what kind of requests. Here we present the design of nntop which has been in production at Twitter in the past 10 months. nntop proved to have low cpu overhead ( 2% in a cluster of 4K nodes), low memory footprint (less than a few MB), and quite efficient for the write path (only two hash lookup for updating a metric).
[jira] [Created] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
Charles Lamb created HDFS-7238: -- Summary: TestOpenFilesWithSnapshot fails periodically with test timeout Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
[ https://issues.apache.org/jira/browse/HDFS-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7238: --- Attachment: HDFS-7238.001.patch It seems that adding a timeout (120s) argument to the @Tests in the file will fix this. Attaching a patch for a jenkins run. TestOpenFilesWithSnapshot fails periodically with test timeout -- Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7238.001.patch TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
[ https://issues.apache.org/jira/browse/HDFS-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7238: --- Status: Patch Available (was: Open) TestOpenFilesWithSnapshot fails periodically with test timeout -- Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7238.001.patch TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7222) Expose DataNode network errors as a metric
[ https://issues.apache.org/jira/browse/HDFS-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7222: --- Status: Patch Available (was: Open) Expose DataNode network errors as a metric -- Key: HDFS-7222 URL: https://issues.apache.org/jira/browse/HDFS-7222 Project: Hadoop HDFS Issue Type: New Feature Components: datanode Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7222.001.patch It would be useful to track datanode network errors and expose them as a metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that the operation can succeed.
[ https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169832#comment-14169832 ] Colin Patrick McCabe commented on HDFS-7121: Sounds good. Thanks for working on this. For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that the operation can succeed. Key: HDFS-7121 URL: https://issues.apache.org/jira/browse/HDFS-7121 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node Reporter: Chris Nauroth Assignee: Chris Nauroth Several JournalNode operations are not satisfied by a quorum. They must succeed on every JournalNode in the cluster. If the operation succeeds on some nodes, but fails on others, then this may leave the nodes in an inconsistent state and require operations to do manual recovery steps. For example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node, then the operator will need to correct the problem on the failed node and also manually restore the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
[ https://issues.apache.org/jira/browse/HDFS-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169833#comment-14169833 ] Jing Zhao commented on HDFS-7238: - Duplicate with HDFS-7236? TestOpenFilesWithSnapshot fails periodically with test timeout -- Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7238.001.patch TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
[ https://issues.apache.org/jira/browse/HDFS-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7238: --- Status: Open (was: Patch Available) TestOpenFilesWithSnapshot fails periodically with test timeout -- Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7238.001.patch TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
[ https://issues.apache.org/jira/browse/HDFS-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb resolved HDFS-7238. Resolution: Duplicate Duplicate of HDFS-7236 TestOpenFilesWithSnapshot fails periodically with test timeout -- Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7238.001.patch TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169850#comment-14169850 ] Colin Patrick McCabe commented on HDFS-7235: Thanks for explaining this. If I understand correctly, you want blocks that are not in finalized state to cause {{trySendErrorReport}}, but blocks that don't exist or have the wrong length to cause {{reportBadBlocks}}. That seems reasonable. One improvement that I would suggest is that you don't need to add a new method to FsDatasetSpi to do that. Just call {{FsDatasetSpi#getLength}}. If the block doesn't exist, it will throw an IOException which you can catch. Patch looks good aside from that. Can not decommission DN which has invalid block due to bad disk --- Key: HDFS-7235 URL: https://issues.apache.org/jira/browse/HDFS-7235 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7235.001.patch When to decommission a DN, the process hangs. What happens is, when NN chooses a replica as a source to replicate data on the to-be-decommissioned DN to other DNs, it favors choosing this DN to-be-decommissioned as the source of transfer (see BlockManager.java). However, because of the bad disk, the DN would detect the source block to be transfered as invalidBlock with the following logic in FsDatasetImpl.java: {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} The reason that this method returns false (detecting invalid block) is because the block file doesn't exist due to bad disk in this case. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6824) Additional user documentation for HDFS encryption.
[ https://issues.apache.org/jira/browse/HDFS-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-6824: -- Attachment: hdfs-6824.002.patch Thanks for reviewing Yi, good catches. New patch fixes all your comments. Additional user documentation for HDFS encryption. -- Key: HDFS-6824 URL: https://issues.apache.org/jira/browse/HDFS-6824 Project: Hadoop HDFS Issue Type: Sub-task Components: documentation Affects Versions: 2.6.0 Reporter: Andrew Wang Assignee: Andrew Wang Priority: Minor Attachments: TransparentEncryption.html, hdfs-6824.001.patch, hdfs-6824.002.patch We'd like to better document additional things about HDFS encryption: setup and configuration, using alternate access methods (namely WebHDFS and HttpFS), other misc improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7237: -- Description: Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. was: run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7237: -- Attachment: h7237_20141013b.patch new StringBuilder('') does not work well since it is using the StringBuilder(int) constructor. h7237_20141013b.patch: fixes the bug. namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch, h7237_20141013b.patch Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169882#comment-14169882 ] Colin Patrick McCabe commented on HDFS-7207: bq. A slightly simpler (probably subjective) approach might be to wrap things in the opposite way. That is, putting the error message / stack traces in the Status object directly and let hdfsGetLastError to get the string. It avoids copying the error message twice, once from the implementation to the TLS and another from TLS to the returned Status object. What do you think? I think we should definitely add {{hdfsGetLastError}} to the C API. There are a lot of applications using the C API and a lot of them are going to continue to do so for all the reasons we discussed earlier. This is an easy, 100% backwards-compatible way to add richer error messages to the API. I don't think copying an error message once is worth thinking about. Errors are (or should be) rare. The overhead of throwing an exception is much larger than copying a C string, and libhdfs3 currently throws exceptions in error cases. So this is strictly an improvement from a performance point of view. It also simplifies maintenance because we only have to worry about setting error messages at one point. And it's the only way we can add richer error messages in {{libhdfs}} and {{libwebhdfs}}. No other solution even comes close to matching the advantages of {{hdfsGetLastError}}, in my opinion. bq. If an Input / OutputStream is leaked than the corresponding FileSystem will leak. I found the paradigm in leveldb quite helpful: {code} DB *db = DB::Open(); Iterator *it = db-(...); delete db; // bails out because the iterator it has leaked. {code} bq. That might allow the user to be more aware of the leaks. Maybe we can do something similar? I might be missing something, but giving the user back a bare pointer seems strictly less useful than giving the user back a shared_ptr. As [~wangzw] pointed out, we still have to do refcounting either way, so there's no performance improvement. If the user wants shared_ptr semantics and you give them a bare pointer, they have to wrap it in another shared_ptr, adding overhead. On the other hand, if the user doesn't want shared_ptr semantics, there is no disadvantage to giving back a shared_ptr. The user can simply delete the streams, then delete the filesystem, and get the same result as with a bare pointer. If leaks are a problem in a C++ program, there are tools like valgrind, ASAN, and so forth. We use these tools a lot in Impala-- they work really well! Throwing an exception in a delete() method is not really a very robust way of detecting memory leaks. After all, the delete method itself may never be called if the programmer makes a mistake. Finally, to repeat my earlier argument, bare pointers are basically what the C interface gives back. That is the C way-- manual allocation and de-allocation. So if we're going to do the same here, it makes me wonder why we need a new interface. I guess you could argue that it allows us to detect use-after-free, but this is something that valgrind and ASAN do a great job detecting already, and without runtime overhead in production. libhdfs3 should not expose exceptions in public C++ API --- Key: HDFS-7207 URL: https://issues.apache.org/jira/browse/HDFS-7207 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Colin Patrick McCabe Priority: Blocker Attachments: HDFS-7207.001.patch There are three major disadvantages of exposing exceptions in the public API: * Exposing exceptions in public APIs forces the downstream users to be compiled with {{-fexceptions}}, which might be infeasible in many use cases. * It forces other bindings to properly handle all C++ exceptions, which might be infeasible especially when the binding is generated by tools like SWIG. * It forces the downstream users to properly handle all C++ exceptions, which can be cumbersome as in certain cases it will lead to undefined behavior (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6919) Enforce a single limit for RAM disk usage and replicas cached via locking
[ https://issues.apache.org/jira/browse/HDFS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169891#comment-14169891 ] Jitendra Nath Pandey commented on HDFS-6919: +1 for adding a release note for 2.6, and have it implemented in the follow on release. Enforce a single limit for RAM disk usage and replicas cached via locking - Key: HDFS-6919 URL: https://issues.apache.org/jira/browse/HDFS-6919 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Arpit Agarwal Assignee: Colin Patrick McCabe Priority: Blocker The DataNode can have a single limit for memory usage which applies to both replicas cached via CCM and replicas on RAM disk. See comments [1|https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106025page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106025], [2|https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106245page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106245] and [3|https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106575page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106575] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users
[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169944#comment-14169944 ] Andrew Wang commented on HDFS-6982: --- Hi Maysam, It seems like without the HTML / JMX stuff, I missed out on a bunch of context in my review. As to a patch split, here's a suggestion. I believe our new HTML UI sources all of its information by using the {{/jmx}} endpoint. This is good since it means external tools can collect the same information without scraping our UIs. I think a reasonable first patch would add {{/jmx}} output, since then we'll be able to turn it on and add tests. Then, subsequent patches can add the HTML and JS for the WebUI. Alternatively, if you think it's manageable to review the entire patch, we could try giving that a go. My guess though is that the top webpage is currently not using {{/jmx}} though, so the above patch split would be the fastest way to start getting things committed. nntop: top-like tool for name node users - Key: HDFS-6982 URL: https://issues.apache.org/jira/browse/HDFS-6982 Project: Hadoop HDFS Issue Type: New Feature Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, nntop-design-v1.pdf In this jira we motivate the need for nntop, a tool that, similarly to what top does in Linux, gives the list of top users of the HDFS name node and gives insight about which users are sending majority of each traffic type to the name node. This information turns out to be the most critical when the name node is under pressure and the HDFS admin needs to know which user is hammering the name node and with what kind of requests. Here we present the design of nntop which has been in production at Twitter in the past 10 months. nntop proved to have low cpu overhead ( 2% in a cluster of 4K nodes), low memory footprint (less than a few MB), and quite efficient for the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169967#comment-14169967 ] Yongjun Zhang commented on HDFS-7235: - Hi [~cmccabe], Thanks a lot for the input. Yes, I expect {{trySendErrorReport}} to be called when the blocks are not in finalized state, and {{reportBadBlocks}} to be called when the block file doesn't exist. To try what you suggested, when {{isValidBlock}} returns false, I still need to check the other conditions are true: {code} replicaInfo != null replicaInfo.getState() == FINALIZED {code} Right now there is no method to get replicaInfo from the DataNode.java side, except a deprecated method {code} @Deprecated public Replica getReplica(String bpid, long blockId); {code} I will just call this method if it's ok to use. Can not decommission DN which has invalid block due to bad disk --- Key: HDFS-7235 URL: https://issues.apache.org/jira/browse/HDFS-7235 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7235.001.patch When to decommission a DN, the process hangs. What happens is, when NN chooses a replica as a source to replicate data on the to-be-decommissioned DN to other DNs, it favors choosing this DN to-be-decommissioned as the source of transfer (see BlockManager.java). However, because of the bad disk, the DN would detect the source block to be transfered as invalidBlock with the following logic in FsDatasetImpl.java: {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} The reason that this method returns false (detecting invalid block) is because the block file doesn't exist due to bad disk in this case. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7239) Create a servlet for HDFS UI
Haohui Mai created HDFS-7239: Summary: Create a servlet for HDFS UI Key: HDFS-7239 URL: https://issues.apache.org/jira/browse/HDFS-7239 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haohui Mai Assignee: Haohui Mai Currently the HDFS UI gathers most of its information from JMX. There are a couple disadvantages: * JMX is also used by management tools, thus Hadoop needs to maintain compatibility across minor releases. * JMX organizes information as key, value pairs. The organization does not fit well with emerging use cases like startup progress report and nntop. This jira proposes to introduce a new servlet in the NN for the purpose of serving information to the UI. It should be viewed as a part of the UI. There is *no* compatibility guarantees for the output of the servlet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7228) Add an SSD policy into the default BlockStoragePolicySuite
[ https://issues.apache.org/jira/browse/HDFS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169994#comment-14169994 ] Suresh Srinivas commented on HDFS-7228: --- [~jingzhao], is this policy placing all the replicas in SSD? Instead of that, should we place only one replica in SSD and the remaining in default storage? This may be better given SSD is more expensive than disk and may not be as abundant as disk? Applications can place their computation tasks closer to SSD replica (which is possible given block location now includes storage type). Add an SSD policy into the default BlockStoragePolicySuite -- Key: HDFS-7228 URL: https://issues.apache.org/jira/browse/HDFS-7228 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7228.000.patch Currently in the default BlockStoragePolicySuite, we've defined 4 storage policies: LAZY_PERSIST, HOT, WARM, and COLD. Since we have already defined the SSD storage type, it will be useful to also include a SSD related storage policy in the default suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7228) Add an SSD policy into the default BlockStoragePolicySuite
[ https://issues.apache.org/jira/browse/HDFS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169994#comment-14169994 ] Suresh Srinivas edited comment on HDFS-7228 at 10/13/14 9:18 PM: - [~jingzhao], is this policy placing all the replicas in SSD? Instead of that, should we place only one replica in SSD and the remaining in default storage? This may be better given SSD is more expensive than disk and is not as abundant as disk? Applications can place their computation tasks closer to SSD replica (which is possible given block location now includes storage type). was (Author: sureshms): [~jingzhao], is this policy placing all the replicas in SSD? Instead of that, should we place only one replica in SSD and the remaining in default storage? This may be better given SSD is more expensive than disk and may not be as abundant as disk? Applications can place their computation tasks closer to SSD replica (which is possible given block location now includes storage type). Add an SSD policy into the default BlockStoragePolicySuite -- Key: HDFS-7228 URL: https://issues.apache.org/jira/browse/HDFS-7228 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7228.000.patch Currently in the default BlockStoragePolicySuite, we've defined 4 storage policies: LAZY_PERSIST, HOT, WARM, and COLD. Since we have already defined the SSD storage type, it will be useful to also include a SSD related storage policy in the default suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1417#comment-1417 ] Suresh Srinivas commented on HDFS-7237: --- +1 for the patch. Thanks Nicholas for fixing this. namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch, h7237_20141013b.patch Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7239) Create a servlet for HDFS UI
[ https://issues.apache.org/jira/browse/HDFS-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170037#comment-14170037 ] Suresh Srinivas commented on HDFS-7239: --- [~wheat9], JMX interface was introduced to dissuade users from scraping namenode web UI. Since then, anytime a namenode web UI change is introduced, we have also added equivalent JMX interface method/functionality. Moving web UI to use JMX is great to ensure all UI related APIs are available and is maintained and independent UI can be built. If we move all such future functionality to a new servlet, where does that leave JMX interface? Create a servlet for HDFS UI Key: HDFS-7239 URL: https://issues.apache.org/jira/browse/HDFS-7239 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haohui Mai Assignee: Haohui Mai Currently the HDFS UI gathers most of its information from JMX. There are a couple disadvantages: * JMX is also used by management tools, thus Hadoop needs to maintain compatibility across minor releases. * JMX organizes information as key, value pairs. The organization does not fit well with emerging use cases like startup progress report and nntop. This jira proposes to introduce a new servlet in the NN for the purpose of serving information to the UI. It should be viewed as a part of the UI. There is *no* compatibility guarantees for the output of the servlet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7228) Add an SSD policy into the default BlockStoragePolicySuite
[ https://issues.apache.org/jira/browse/HDFS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7228: Attachment: HDFS-7228.001.patch Thanks for the comments, Suresh! So in the new patch I just change the new policy to: {code} storageTypes=[SSD, DISK], creationFallbacks=[SSD, DISK], replicationFallbacks=[SSD, DISK] {code} Thus the first replica will be placed in SSD, and the remaining will be on DISK. If the cluster is run out of SSD, then DISK is used for both block allocation or replica recovery. This policy also covers the scenario where DISK is unavailable (the policy falls back to SSD then) although it is usually rare in practice. Add an SSD policy into the default BlockStoragePolicySuite -- Key: HDFS-7228 URL: https://issues.apache.org/jira/browse/HDFS-7228 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7228.000.patch, HDFS-7228.001.patch Currently in the default BlockStoragePolicySuite, we've defined 4 storage policies: LAZY_PERSIST, HOT, WARM, and COLD. Since we have already defined the SSD storage type, it will be useful to also include a SSD related storage policy in the default suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7056) Snapshot support for truncate
[ https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170060#comment-14170060 ] Jing Zhao commented on HDFS-7056: - The proposed design looks pretty good to me. I agree we can copy the entire block list to file snapshot copy right now. Snapshot support for truncate - Key: HDFS-7056 URL: https://issues.apache.org/jira/browse/HDFS-7056 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: 3.0.0 Reporter: Konstantin Shvachko Implementation of truncate in HDFS-3107 does not allow truncating files which are in a snapshot. It is desirable to be able to truncate and still keep the old file state of the file in the snapshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li reassigned HDFS-6744: - Assignee: Siqi Li Improve decommissioning nodes and dead nodes access on the new NN webUI --- Key: HDFS-6744 URL: https://issues.apache.org/jira/browse/HDFS-6744 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Ming Ma Assignee: Siqi Li The new NN webUI lists live node at the top of the page, followed by dead node and decommissioning node. From admins point of view: 1. Decommissioning nodes and dead nodes are more interesting. It is better to move decommissioning nodes to the top of the page, followed by dead nodes and decommissioning nodes. 2. To find decommissioning nodes or dead nodes, the whole page that includes all nodes needs to be loaded. That could take some time for big clusters. The legacy web UI filters out the type of nodes dynamically. That seems to work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated HDFS-6744: -- Attachment: HDFS-6744.v1.patch Improve decommissioning nodes and dead nodes access on the new NN webUI --- Key: HDFS-6744 URL: https://issues.apache.org/jira/browse/HDFS-6744 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Ming Ma Assignee: Siqi Li Attachments: HDFS-6744.v1.patch The new NN webUI lists live node at the top of the page, followed by dead node and decommissioning node. From admins point of view: 1. Decommissioning nodes and dead nodes are more interesting. It is better to move decommissioning nodes to the top of the page, followed by dead nodes and decommissioning nodes. 2. To find decommissioning nodes or dead nodes, the whole page that includes all nodes needs to be loaded. That could take some time for big clusters. The legacy web UI filters out the type of nodes dynamically. That seems to work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated HDFS-6744: -- Status: Patch Available (was: Open) Improve decommissioning nodes and dead nodes access on the new NN webUI --- Key: HDFS-6744 URL: https://issues.apache.org/jira/browse/HDFS-6744 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Ming Ma Assignee: Siqi Li Attachments: HDFS-6744.v1.patch The new NN webUI lists live node at the top of the page, followed by dead node and decommissioning node. From admins point of view: 1. Decommissioning nodes and dead nodes are more interesting. It is better to move decommissioning nodes to the top of the page, followed by dead nodes and decommissioning nodes. 2. To find decommissioning nodes or dead nodes, the whole page that includes all nodes needs to be loaded. That could take some time for big clusters. The legacy web UI filters out the type of nodes dynamically. That seems to work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170092#comment-14170092 ] Hadoop QA commented on HDFS-7237: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674564/h7237_20141013.patch against trunk revision a56ea01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestRenameWhileOpen {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8405//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8405//console This message is automatically generated. namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch, h7237_20141013b.patch Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7222) Expose DataNode network errors as a metric
[ https://issues.apache.org/jira/browse/HDFS-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170093#comment-14170093 ] Hadoop QA commented on HDFS-7222: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674545/HDFS-7222.001.patch against trunk revision a56ea01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestRenameWhileOpen {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8406//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8406//console This message is automatically generated. Expose DataNode network errors as a metric -- Key: HDFS-7222 URL: https://issues.apache.org/jira/browse/HDFS-7222 Project: Hadoop HDFS Issue Type: New Feature Components: datanode Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7222.001.patch It would be useful to track datanode network errors and expose them as a metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6824) Additional user documentation for HDFS encryption.
[ https://issues.apache.org/jira/browse/HDFS-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170132#comment-14170132 ] Hadoop QA commented on HDFS-6824: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674577/hdfs-6824.002.patch against trunk revision a56ea01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8409//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8409//console This message is automatically generated. Additional user documentation for HDFS encryption. -- Key: HDFS-6824 URL: https://issues.apache.org/jira/browse/HDFS-6824 Project: Hadoop HDFS Issue Type: Sub-task Components: documentation Affects Versions: 2.6.0 Reporter: Andrew Wang Assignee: Andrew Wang Priority: Minor Attachments: TransparentEncryption.html, hdfs-6824.001.patch, hdfs-6824.002.patch We'd like to better document additional things about HDFS encryption: setup and configuration, using alternate access methods (namely WebHDFS and HttpFS), other misc improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7142) Implement a 2Q eviction strategy for HDFS-6581
[ https://issues.apache.org/jira/browse/HDFS-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170134#comment-14170134 ] Colin Patrick McCabe commented on HDFS-7142: bq. The call to ceiling need not return the exact match. If you get a non-null result you need a check for blockId and bpid match. Perhaps I misunderstood the intention. You're right... I need to check to make sure that the bpid and block id are the same after getting back a result from {{ceiling}}. Fixed. bq. dequeueNextReplicaToPersist appears to have a starvation issue. If replicas in a higher blockPoolId keep getting added constantly a replica in a lower bpid may wait indefinitely to get persisted. It would be good to persist replicas in the same order in which they were originally added, you can do that with an additional set. My mistake here was looking at the lowest value in {{replicasSortedByBlockPoolAndId}}. It should be looking at the lowest value in {{replicasSortedByTierAndLastUsed}}. There's no starvation issue if it looks at the set which is sorted by lastUsed, because oldest replicas will get picked first. Fixed. bq. Same issue with numReplicasNotPersisted, it should not count all replicas in RAM. Let me clarify the documentation on RamDiskReplicaTracker.dequeueNextReplicaToPersist. It seems to me that the number of replicas not persisted *is* all the replicas in RAM. So perhaps the function needs to be renamed. Can you clarify what this should count? bq. Colin Patrick McCabe, any comments and updates to the patch? Let me repost a patch fixing the first two issues pointing out. I will wait for clarification on any other API issues. I'm going to mark this as targetting 2.7. Implement a 2Q eviction strategy for HDFS-6581 -- Key: HDFS-7142 URL: https://issues.apache.org/jira/browse/HDFS-7142 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: 0002-Add-RamDiskReplica2QTracker.patch We should implement a 2Q or approximate 2Q eviction strategy for HDFS-6581. It is well known that LRU is a poor fit for scanning workloads, which HDFS may often encounter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7142) Implement a 2Q eviction strategy for HDFS-6581
[ https://issues.apache.org/jira/browse/HDFS-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170134#comment-14170134 ] Colin Patrick McCabe edited comment on HDFS-7142 at 10/13/14 10:35 PM: --- bq. The call to ceiling need not return the exact match. If you get a non-null result you need a check for blockId and bpid match. Perhaps I misunderstood the intention. You're right... I need to check to make sure that the bpid and block id are the same after getting back a result from {{ceiling}}. Fixed. bq. dequeueNextReplicaToPersist appears to have a starvation issue. If replicas in a higher blockPoolId keep getting added constantly a replica in a lower bpid may wait indefinitely to get persisted. It would be good to persist replicas in the same order in which they were originally added, you can do that with an additional set. My mistake here was looking at the lowest value in {{replicasSortedByBlockPoolAndId}}. It should be looking at the lowest value in {{replicasSortedByTierAndLastUsed}}. There's no starvation issue if it looks at the set which is sorted by lastUsed, because oldest replicas will get picked first. Fixed. bq. Same issue with numReplicasNotPersisted, it should not count all replicas in RAM. Let me clarify the documentation on RamDiskReplicaTracker.dequeueNextReplicaToPersist. It seems to me that the number of replicas not persisted *is* all the replicas in RAM. So perhaps the function needs to be renamed. Can you clarify what this should count? bq. Colin Patrick McCabe, any comments and updates to the patch? Let me repost a patch fixing the first two issues pointed out. I will wait for clarification on any other API issues. I'm going to mark this as targetting 2.7. was (Author: cmccabe): bq. The call to ceiling need not return the exact match. If you get a non-null result you need a check for blockId and bpid match. Perhaps I misunderstood the intention. You're right... I need to check to make sure that the bpid and block id are the same after getting back a result from {{ceiling}}. Fixed. bq. dequeueNextReplicaToPersist appears to have a starvation issue. If replicas in a higher blockPoolId keep getting added constantly a replica in a lower bpid may wait indefinitely to get persisted. It would be good to persist replicas in the same order in which they were originally added, you can do that with an additional set. My mistake here was looking at the lowest value in {{replicasSortedByBlockPoolAndId}}. It should be looking at the lowest value in {{replicasSortedByTierAndLastUsed}}. There's no starvation issue if it looks at the set which is sorted by lastUsed, because oldest replicas will get picked first. Fixed. bq. Same issue with numReplicasNotPersisted, it should not count all replicas in RAM. Let me clarify the documentation on RamDiskReplicaTracker.dequeueNextReplicaToPersist. It seems to me that the number of replicas not persisted *is* all the replicas in RAM. So perhaps the function needs to be renamed. Can you clarify what this should count? bq. Colin Patrick McCabe, any comments and updates to the patch? Let me repost a patch fixing the first two issues pointing out. I will wait for clarification on any other API issues. I'm going to mark this as targetting 2.7. Implement a 2Q eviction strategy for HDFS-6581 -- Key: HDFS-7142 URL: https://issues.apache.org/jira/browse/HDFS-7142 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: 0002-Add-RamDiskReplica2QTracker.patch We should implement a 2Q or approximate 2Q eviction strategy for HDFS-6581. It is well known that LRU is a poor fit for scanning workloads, which HDFS may often encounter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7142) Implement a 2Q eviction strategy for HDFS-6581
[ https://issues.apache.org/jira/browse/HDFS-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7142: --- Attachment: HDFS-7142.003.patch Implement a 2Q eviction strategy for HDFS-6581 -- Key: HDFS-7142 URL: https://issues.apache.org/jira/browse/HDFS-7142 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: 0002-Add-RamDiskReplica2QTracker.patch, HDFS-7142.003.patch We should implement a 2Q or approximate 2Q eviction strategy for HDFS-6581. It is well known that LRU is a poor fit for scanning workloads, which HDFS may often encounter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7238) TestOpenFilesWithSnapshot fails periodically with test timeout
[ https://issues.apache.org/jira/browse/HDFS-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170138#comment-14170138 ] Hadoop QA commented on HDFS-7238: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674573/HDFS-7238.001.patch against trunk revision a56ea01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8407//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8407//console This message is automatically generated. TestOpenFilesWithSnapshot fails periodically with test timeout -- Key: HDFS-7238 URL: https://issues.apache.org/jira/browse/HDFS-7238 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7238.001.patch TestOpenFilesWithSnapshot fails periodically with this: {noformat} Error Message Timed out waiting for Mini HDFS Cluster to start Stacktrace java.io.IOException: Timed out waiting for Mini HDFS Cluster to start at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1789) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.doTestMultipleSnapshots(TestOpenFilesWithSnapshot.java:184) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots(TestOpenFilesWithSnapshot.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7142) Implement a 2Q eviction strategy for HDFS-6581
[ https://issues.apache.org/jira/browse/HDFS-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7142: --- Status: Patch Available (was: In Progress) Implement a 2Q eviction strategy for HDFS-6581 -- Key: HDFS-7142 URL: https://issues.apache.org/jira/browse/HDFS-7142 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.7.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: 0002-Add-RamDiskReplica2QTracker.patch, HDFS-7142.003.patch We should implement a 2Q or approximate 2Q eviction strategy for HDFS-6581. It is well known that LRU is a poor fit for scanning workloads, which HDFS may often encounter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7142) Implement a 2Q eviction strategy for HDFS-6581
[ https://issues.apache.org/jira/browse/HDFS-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7142: --- Target Version/s: 2.7.0 Affects Version/s: 2.7.0 Implement a 2Q eviction strategy for HDFS-6581 -- Key: HDFS-7142 URL: https://issues.apache.org/jira/browse/HDFS-7142 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.7.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: 0002-Add-RamDiskReplica2QTracker.patch, HDFS-7142.003.patch We should implement a 2Q or approximate 2Q eviction strategy for HDFS-6581. It is well known that LRU is a poor fit for scanning workloads, which HDFS may often encounter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170174#comment-14170174 ] Hadoop QA commented on HDFS-6744: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674616/HDFS-6744.v1.patch against trunk revision 178bc50. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8411//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8411//console This message is automatically generated. Improve decommissioning nodes and dead nodes access on the new NN webUI --- Key: HDFS-6744 URL: https://issues.apache.org/jira/browse/HDFS-6744 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Ming Ma Assignee: Siqi Li Attachments: HDFS-6744.v1.patch The new NN webUI lists live node at the top of the page, followed by dead node and decommissioning node. From admins point of view: 1. Decommissioning nodes and dead nodes are more interesting. It is better to move decommissioning nodes to the top of the page, followed by dead nodes and decommissioning nodes. 2. To find decommissioning nodes or dead nodes, the whole page that includes all nodes needs to be loaded. That could take some time for big clusters. The legacy web UI filters out the type of nodes dynamically. That seems to work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
[ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7208: -- Assignee: Ming Ma Status: Patch Available (was: Open) NN doesn't schedule replication when a DN storage fails --- Key: HDFS-7208 URL: https://issues.apache.org/jira/browse/HDFS-7208 Project: Hadoop HDFS Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7208.patch We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication. A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be 0, NN still considers that DN healthy. 1. A disk failed. All blocks of that disk are removed from DN dataset. {noformat} 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current {noformat} 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage. {noformat} 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939): DataNode failed volumes:/data/disk6/dfs/current {noformat} 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
[ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7208: -- Attachment: HDFS-7208.patch Here is the initial patch based on heartbeat notification approach, the assumption is DN will report all healthy storages in the heartbeat. This approach is simpler than the blockReport approach which needs to have DN persist the info to cover some failure scenarios. It also makes storage failure detection faster. 1. NN detects failed storages during HB processing based on the delta between DN's reported healthy storages and the storages NN has. Marked the state of those missing storages DatanodeStorage.State.FAILED. 2. HeartbeatManager will remove blocks on those DatanodeStorage.State.FAILED storages. This will cover some corner scenarios where new replicas might be added to BlocksMap afterwards. 3. It also covers the case where admins reduce the number of healthy volumes on DN and restart DN. NN doesn't schedule replication when a DN storage fails --- Key: HDFS-7208 URL: https://issues.apache.org/jira/browse/HDFS-7208 Project: Hadoop HDFS Issue Type: Bug Reporter: Ming Ma Attachments: HDFS-7208.patch We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication. A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be 0, NN still considers that DN healthy. 1. A disk failed. All blocks of that disk are removed from DN dataset. {noformat} 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current {noformat} 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage. {noformat} 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939): DataNode failed volumes:/data/disk6/dfs/current {noformat} 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170184#comment-14170184 ] Hadoop QA commented on HDFS-7237: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674581/h7237_20141013b.patch against trunk revision a56ea01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8408//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8408//console This message is automatically generated. namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch, h7237_20141013b.patch Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6745) Display the list of very-under-replicated blocks as well as the files on NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170203#comment-14170203 ] Ming Ma commented on HDFS-6745: --- At RPC layer, we can add a new method similar to ClientProtocol.listCorruptFileBlocks. Maybe the new method can take replication threshold as a parameter to retrieve all blocks below that threshold. Then ClientProtocol.listCorruptFileBlocks could become a special case of the new method. Display the list of very-under-replicated blocks as well as the files on NN webUI --- Key: HDFS-6745 URL: https://issues.apache.org/jira/browse/HDFS-6745 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Ming Ma Sometimes admins want to know the list of very-under-replicated blocks before major actions such as decommission; as these blocks are more likely to turn into missing blocks. very-under-replicated blocks are those blocks with live replica count of 1 and replicator factor of = 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-7235: Attachment: HDFS-7235.002.patch Can not decommission DN which has invalid block due to bad disk --- Key: HDFS-7235 URL: https://issues.apache.org/jira/browse/HDFS-7235 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch When to decommission a DN, the process hangs. What happens is, when NN chooses a replica as a source to replicate data on the to-be-decommissioned DN to other DNs, it favors choosing this DN to-be-decommissioned as the source of transfer (see BlockManager.java). However, because of the bad disk, the DN would detect the source block to be transfered as invalidBlock with the following logic in FsDatasetImpl.java: {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} The reason that this method returns false (detecting invalid block) is because the block file doesn't exist due to bad disk in this case. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7165) Separate block metrics for files with replication count 1
[ https://issues.apache.org/jira/browse/HDFS-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170237#comment-14170237 ] Andrew Wang commented on HDFS-7165: --- Hi Zhe, generally looks good. A few review comments: * Need javadoc on new MXBean method * For naming, you could try MissingBlocksWithReplOne as a shorter alternative to Phil's suggestion. * One whitespace-only change in TestMissingBlocksAlert * UnderReplicatedBlocks looks like a standalone class, so we happily might be able to write some actual unit tests. TestUnderReplicatedBlockQueues has an example. Would be good to test remove and update in addition to test, this will simulate block deletion and setrep (up and down). * TestUnderReplicatedBlockQueues also does something lazy and extends Assert rather than doing the static imports, it'd be cool to fix this up too if you edit this file. Separate block metrics for files with replication count 1 - Key: HDFS-7165 URL: https://issues.apache.org/jira/browse/HDFS-7165 Project: Hadoop HDFS Issue Type: Improvement Reporter: Andrew Wang Assignee: Zhe Zhang Attachments: HDFS-7165-20141003-v1.patch, HDFS-7165-20141009-v1.patch, HDFS-7165-20141010-v1.patch We see a lot of escalations because someone has written teragen output with a replication factor of 1, a DN goes down, and a bunch of missing blocks show up. These are normally false positives, since teragen output is disposable, and generally speaking, users should understand this is true for all repl=1 files. It'd be nice to be able to separate out these repl=1 missing blocks from missing blocks with higher replication factors.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7240) Object store in HDFS
Jitendra Nath Pandey created HDFS-7240: -- Summary: Object store in HDFS Key: HDFS-7240 URL: https://issues.apache.org/jira/browse/HDFS-7240 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jitendra Nath Pandey Assignee: Jitendra Nath Pandey This jira proposes to add object store capabilities into HDFS. As part of the federation work (HDFS-1052) we separated block storage as a generic storage layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the storage layer i.e. datanodes. In this jira I will explore building an object store using the datanode storage, but independent of namespace metadata. I will soon update with a detailed design document. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170238#comment-14170238 ] Yongjun Zhang commented on HDFS-7235: - Hi [~cmccabe], Thanks for your earlier review. I just uploaded a new rev per what you suggested. (002) There is one issue with this approach, the changed code in DataNode now kind of sees the FSdatasetImpl implemetation. But maybe it's fine. BTW, since I need to get replicaInfo in DataNode, and I need to make sure the replica state is FINALIZED, I simply called the exists() method to check block file existence. Thanks. Can not decommission DN which has invalid block due to bad disk --- Key: HDFS-7235 URL: https://issues.apache.org/jira/browse/HDFS-7235 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch When to decommission a DN, the process hangs. What happens is, when NN chooses a replica as a source to replicate data on the to-be-decommissioned DN to other DNs, it favors choosing this DN to-be-decommissioned as the source of transfer (see BlockManager.java). However, because of the bad disk, the DN would detect the source block to be transfered as invalidBlock with the following logic in FsDatasetImpl.java: {code} /** Does the block exist and have the given state? */ private boolean isValid(final ExtendedBlock b, final ReplicaState state) { final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return replicaInfo != null replicaInfo.getState() == state replicaInfo.getBlockFile().exists(); } {code} The reason that this method returns false (detecting invalid block) is because the block file doesn't exist due to bad disk in this case. The key issue we found here is, after DN detects an invalid block for the above reason, it doesn't report the invalid block back to NN, thus NN doesn't know that the block is corrupted, and keeps sending the data transfer request to the same DN to be decommissioned, again and again. This caused an infinite loop, so the decommission process hangs. Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-7215) Add gc log to NFS gateway
[ https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li reopened HDFS-7215: -- Add gc log to NFS gateway - Key: HDFS-7215 URL: https://issues.apache.org/jira/browse/HDFS-7215 Project: Hadoop HDFS Issue Type: Improvement Components: nfs Reporter: Brandon Li Assignee: Brandon Li Like NN/DN, a GC log would help debug issues in NFS gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7215) Add gc log to NFS gateway
[ https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170252#comment-14170252 ] Brandon Li commented on HDFS-7215: -- Thanks, [~cmccabe]. I reopened the JIRA to add JvmpauseMonitor. Will also update the user guide for HADOOP_NFS3_OPTS. Add gc log to NFS gateway - Key: HDFS-7215 URL: https://issues.apache.org/jira/browse/HDFS-7215 Project: Hadoop HDFS Issue Type: Improvement Components: nfs Reporter: Brandon Li Assignee: Brandon Li Like NN/DN, a GC log would help debug issues in NFS gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7228) Add an SSD policy into the default BlockStoragePolicySuite
[ https://issues.apache.org/jira/browse/HDFS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170304#comment-14170304 ] Hadoop QA commented on HDFS-7228: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674609/HDFS-7228.001.patch against trunk revision 178bc50. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8410//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8410//console This message is automatically generated. Add an SSD policy into the default BlockStoragePolicySuite -- Key: HDFS-7228 URL: https://issues.apache.org/jira/browse/HDFS-7228 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7228.000.patch, HDFS-7228.001.patch Currently in the default BlockStoragePolicySuite, we've defined 4 storage policies: LAZY_PERSIST, HOT, WARM, and COLD. Since we have already defined the SSD storage type, it will be useful to also include a SSD related storage policy in the default suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7230) Support rolling downgrade
[ https://issues.apache.org/jira/browse/HDFS-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170306#comment-14170306 ] Tsz Wo Nicholas Sze commented on HDFS-7230: --- Same as downgrade, rolling downgrade requires the same NAMENODE_LAYOUT_VERSION and the same DATANODE_LAYOUT_VERSION. Although there is no layout change, a cluster may not be downgraded using the same rolling upgrade procedure since protocols may change in a backward compatible manner but not forward compatible, i.e. old DNs can talk to the new NNs but new DNs may not be able to talk the old NNs. Support rolling downgrade - Key: HDFS-7230 URL: https://issues.apache.org/jira/browse/HDFS-7230 Project: Hadoop HDFS Issue Type: Improvement Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze HDFS-5535 made a lot of improvement on rolling upgrade. It also added the cluster downgrade feature. However, the downgrade described in HDFS-5535 requires cluster downtime. In this JIRA, we discuss how to do rolling downgrade, i.e. downgrade without downtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7228) Add an SSD policy into the default BlockStoragePolicySuite
[ https://issues.apache.org/jira/browse/HDFS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170321#comment-14170321 ] Tsz Wo Nicholas Sze commented on HDFS-7228: --- - I think we need both all-SSD and one-SSD policies. All-SSD is useful for high performance applications. - Since storage policy is not released yet, let's renumber the policy IDs. Otherwise, the upper IDs become all used. - Do you also want to add constants for the IDs? Add an SSD policy into the default BlockStoragePolicySuite -- Key: HDFS-7228 URL: https://issues.apache.org/jira/browse/HDFS-7228 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7228.000.patch, HDFS-7228.001.patch Currently in the default BlockStoragePolicySuite, we've defined 4 storage policies: LAZY_PERSIST, HOT, WARM, and COLD. Since we have already defined the SSD storage type, it will be useful to also include a SSD related storage policy in the default suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6824) Additional user documentation for HDFS encryption.
[ https://issues.apache.org/jira/browse/HDFS-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170323#comment-14170323 ] Yi Liu commented on HDFS-6824: -- Thanks Andrew for updating the patch. You are right, I see for the second comment you fix it as {{HDFS user will not have access to unencrypted encryption keys}}. Additional user documentation for HDFS encryption. -- Key: HDFS-6824 URL: https://issues.apache.org/jira/browse/HDFS-6824 Project: Hadoop HDFS Issue Type: Sub-task Components: documentation Affects Versions: 2.6.0 Reporter: Andrew Wang Assignee: Andrew Wang Priority: Minor Attachments: TransparentEncryption.html, hdfs-6824.001.patch, hdfs-6824.002.patch We'd like to better document additional things about HDFS encryption: setup and configuration, using alternate access methods (namely WebHDFS and HttpFS), other misc improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7230) Support rolling downgrade
[ https://issues.apache.org/jira/browse/HDFS-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170329#comment-14170329 ] Tsz Wo Nicholas Sze commented on HDFS-7230: --- Here is the Rolling Downgrade procedure. Suppose a rolling upgrade is in progress in a a HA cluster. # Downgrade DNs ## Choose a small subset of datanodes (e.g. all datanodes under a particular rack). ### Run hdfs dfsadmin -shutdownDatanode DATANODE_HOST:IPC_PORT upgrade to shutdown one of the chosen datanodes. ### Run hdfs dfsadmin -getDatanodeInfo DATANODE_HOST:IPC_PORT to check and wait for the datanode to shutdown. ### Downgrade and restart the datanode. ### Perform the above steps for all the chosen datanodes in the subset in parallel. ## Repeat the above steps until all datanodes in the cluster are downgraded. # Downgrade Active and Standby NNs: NN1 is active and NN2 is standby. ## Shutdown and downgrade NN2. ## Start NN2 as standby (the “-rollingUpgrade downgrade” option is not needed) ## Failover from NN1 to NN2 so that NN2 becomes active and NN1 becomes standby. ## Shutdown and downgrade NN1. ## Start NN1 as standby (the “-rollingUpgrade downgrade” option is not needed). # Finalize ## Run hdfs dfsadmin -rollingUpgrade finalize to finalize the procedure. Support rolling downgrade - Key: HDFS-7230 URL: https://issues.apache.org/jira/browse/HDFS-7230 Project: Hadoop HDFS Issue Type: Improvement Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze HDFS-5535 made a lot of improvement on rolling upgrade. It also added the cluster downgrade feature. However, the downgrade described in HDFS-5535 requires cluster downtime. In this JIRA, we discuss how to do rolling downgrade, i.e. downgrade without downtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7228) Add an SSD policy into the default BlockStoragePolicySuite
[ https://issues.apache.org/jira/browse/HDFS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7228: Attachment: HDFS-7228.002.patch Thanks for the review, Nicholas! Update the patch to address your comments. Add an SSD policy into the default BlockStoragePolicySuite -- Key: HDFS-7228 URL: https://issues.apache.org/jira/browse/HDFS-7228 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7228.000.patch, HDFS-7228.001.patch, HDFS-7228.002.patch Currently in the default BlockStoragePolicySuite, we've defined 4 storage policies: LAZY_PERSIST, HOT, WARM, and COLD. Since we have already defined the SSD storage type, it will be useful to also include a SSD related storage policy in the default suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7237) namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170349#comment-14170349 ] Hudson commented on HDFS-7237: -- FAILURE: Integrated in Hadoop-trunk-Commit #6257 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6257/]) HDFS-7237. The command hdfs namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException. (szetszwo: rev f6d0b8892ab116514fd031a61441141ac3bdfeb5) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestHdfsServerConstants.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeOptionParsing.java namenode -rollingUpgrade throws ArrayIndexOutOfBoundsException -- Key: HDFS-7237 URL: https://issues.apache.org/jira/browse/HDFS-7237 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Priority: Minor Attachments: h7237_20141013.patch, h7237_20141013b.patch Run hdfs namenode -rollingUpgrade {noformat} 14/10/13 11:30:50 INFO namenode.NameNode: createNameNode [-rollingUpgrade] 14/10/13 11:30:50 FATAL namenode.NameNode: Exception in namenode join java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hdfs.server.namenode.NameNode.parseArguments(NameNode.java:1252) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1367) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1501) 14/10/13 11:30:50 INFO util.ExitUtil: Exiting with status 1 {noformat} Although the command is illegal (missing rolling upgrade startup option), it should print a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)