[jira] [Updated] (HDFS-6606) Optimize HDFS Encrypted Transport performance
[ https://issues.apache.org/jira/browse/HDFS-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated HDFS-6606: - Attachment: HDFS-6606.006.patch Rebase the patch for latest trunk again. Optimize HDFS Encrypted Transport performance - Key: HDFS-6606 URL: https://issues.apache.org/jira/browse/HDFS-6606 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, hdfs-client, security Reporter: Yi Liu Assignee: Yi Liu Attachments: HDFS-6606.001.patch, HDFS-6606.002.patch, HDFS-6606.003.patch, HDFS-6606.004.patch, HDFS-6606.005.patch, HDFS-6606.006.patch, OptimizeHdfsEncryptedTransportperformance.pdf In HDFS-3637, [~atm] added support for encrypting the DataTransferProtocol, it was a great work. It utilizes SASL {{Digest-MD5}} mechanism (use Qop: auth-conf), it supports three security strength: * high 3des or rc4 (128bits) * medium des or rc4(56bits) * low rc4(40bits) 3des and rc4 are slow, only *tens of MB/s*, http://www.javamex.com/tutorials/cryptography/ciphers.shtml http://www.cs.wustl.edu/~jain/cse567-06/ftp/encryption_perf/ I will give more detailed performance data in future. Absolutely it’s bottleneck and will vastly affect the end to end performance. AES(Advanced Encryption Standard) is recommended as a replacement of DES, it’s more secure; with AES-NI support, the throughput can reach nearly *2GB/s*, it won’t be the bottleneck any more, AES and CryptoCodec work is supported in HADOOP-10150, HADOOP-10603 and HADOOP-10693 (We may need to add a new mode support for AES). This JIRA will use AES with AES-NI support as encryption algorithm for DataTransferProtocol. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6581) Write to single replica in memory
[ https://issues.apache.org/jira/browse/HDFS-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144432#comment-14144432 ] Hadoop QA commented on HDFS-6581: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670595/HDFS-6581.merge.10.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 32 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-httpfs: org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8159//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8159//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8159//console This message is automatically generated. Write to single replica in memory - Key: HDFS-6581 URL: https://issues.apache.org/jira/browse/HDFS-6581 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: HDFS-6581.merge.01.patch, HDFS-6581.merge.02.patch, HDFS-6581.merge.03.patch, HDFS-6581.merge.04.patch, HDFS-6581.merge.05.patch, HDFS-6581.merge.06.patch, HDFS-6581.merge.07.patch, HDFS-6581.merge.08.patch, HDFS-6581.merge.09.patch, HDFS-6581.merge.10.patch, HDFSWriteableReplicasInMemory.pdf, Test-Plan-for-HDFS-6581-Memory-Storage.pdf Per discussion with the community on HDFS-5851, we will implement writing to a single replica in DN memory via DataTransferProtocol. This avoids some of the issues with short-circuit writes, which we can revisit at a later time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6606) Optimize HDFS Encrypted Transport performance
[ https://issues.apache.org/jira/browse/HDFS-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144517#comment-14144517 ] Hadoop QA commented on HDFS-6606: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670626/HDFS-6606.006.patch against trunk revision a9a55db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.crypto.random.TestOsSecureRandom org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8161//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8161//console This message is automatically generated. Optimize HDFS Encrypted Transport performance - Key: HDFS-6606 URL: https://issues.apache.org/jira/browse/HDFS-6606 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, hdfs-client, security Reporter: Yi Liu Assignee: Yi Liu Attachments: HDFS-6606.001.patch, HDFS-6606.002.patch, HDFS-6606.003.patch, HDFS-6606.004.patch, HDFS-6606.005.patch, HDFS-6606.006.patch, OptimizeHdfsEncryptedTransportperformance.pdf In HDFS-3637, [~atm] added support for encrypting the DataTransferProtocol, it was a great work. It utilizes SASL {{Digest-MD5}} mechanism (use Qop: auth-conf), it supports three security strength: * high 3des or rc4 (128bits) * medium des or rc4(56bits) * low rc4(40bits) 3des and rc4 are slow, only *tens of MB/s*, http://www.javamex.com/tutorials/cryptography/ciphers.shtml http://www.cs.wustl.edu/~jain/cse567-06/ftp/encryption_perf/ I will give more detailed performance data in future. Absolutely it’s bottleneck and will vastly affect the end to end performance. AES(Advanced Encryption Standard) is recommended as a replacement of DES, it’s more secure; with AES-NI support, the throughput can reach nearly *2GB/s*, it won’t be the bottleneck any more, AES and CryptoCodec work is supported in HADOOP-10150, HADOOP-10603 and HADOOP-10693 (We may need to add a new mode support for AES). This JIRA will use AES with AES-NI support as encryption algorithm for DataTransferProtocol. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7128) Decommission slows way down when it gets towards the end
[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144516#comment-14144516 ] Hadoop QA commented on HDFS-7128: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670602/HDFS-7128.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.balancer.TestBalancer org.apache.hadoop.hdfs.server.datanode.fsdataset.TestAvailableSpaceVolumeChoosingPolicy org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8160//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8160//console This message is automatically generated. Decommission slows way down when it gets towards the end Key: HDFS-7128 URL: https://issues.apache.org/jira/browse/HDFS-7128 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7128.patch When we decommission nodes across different racks, the decommission process becomes really slow at the end, hardly making any progress. The problem is some blocks are on 3 decomm-in-progress DNs and the way how replications are scheduled caused unnecessary delay. Here is the analysis. When BlockManager schedules the replication work from neededReplication, it first needs to pick the source node for replication via chooseSourceDatanode. The core policies to pick the source node are: 1. Prefer decomm-in-progress node. 2. Only pick the nodes whose outstanding replication counts are below thresholds dfs.namenode.replication.max-streams or dfs.namenode.replication.max-streams-hard-limit, based on the replication priority. When we decommission nodes, 1. All the decommission nodes' blocks will be added to neededReplication. 2. BM will pick X number of blocks from neededReplication in each iteration. X is based on cluster size and some configurable multiplier. So if the cluster has 2000 nodes, X will be around 4000. 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up being chosen as the source node of all these 4000 nodes. The reason the outstanding replication thresholds don't kick is due to the implementation of BlockManager.computeReplicationWorkForBlocks; node.getNumberOfBlocksToBeReplicated() remains zero given node.addBlockToBeReplicated is called after source node iteration. {noformat} ... synchronized (neededReplications) { for (int priority = 0; priority blocksToReplicate.size(); priority++) { ... chooseSourceDatanode ... } for(ReplicationWork rw : work){ ... rw.srcNode.addBlockToBeReplicated(block, targets); ... } {noformat} 4. So several decomm-in-progress nodes A, B, C end up with 4000 node.getNumberOfBlocksToBeReplicated(). 5. If we assume each node can replicate 5 blocks per minutes, it is going to take 800 minutes to finish replication of these blocks. 6. Pending replication timeout kick in after 5 minutes. The items will be removed from the pending replication queue and added back to neededReplication. The replications will then be handled by other source nodes of these blocks. But the blocks still remain in nodes A, B, C's pending replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue the replications of these blocks, although these blocks might
[jira] [Commented] (HDFS-6606) Optimize HDFS Encrypted Transport performance
[ https://issues.apache.org/jira/browse/HDFS-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144534#comment-14144534 ] Yi Liu commented on HDFS-6606: -- Test failures are unrelated. [~atm] and [~tucu00], do you have further comments? Thanks. Optimize HDFS Encrypted Transport performance - Key: HDFS-6606 URL: https://issues.apache.org/jira/browse/HDFS-6606 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, hdfs-client, security Reporter: Yi Liu Assignee: Yi Liu Attachments: HDFS-6606.001.patch, HDFS-6606.002.patch, HDFS-6606.003.patch, HDFS-6606.004.patch, HDFS-6606.005.patch, HDFS-6606.006.patch, OptimizeHdfsEncryptedTransportperformance.pdf In HDFS-3637, [~atm] added support for encrypting the DataTransferProtocol, it was a great work. It utilizes SASL {{Digest-MD5}} mechanism (use Qop: auth-conf), it supports three security strength: * high 3des or rc4 (128bits) * medium des or rc4(56bits) * low rc4(40bits) 3des and rc4 are slow, only *tens of MB/s*, http://www.javamex.com/tutorials/cryptography/ciphers.shtml http://www.cs.wustl.edu/~jain/cse567-06/ftp/encryption_perf/ I will give more detailed performance data in future. Absolutely it’s bottleneck and will vastly affect the end to end performance. AES(Advanced Encryption Standard) is recommended as a replacement of DES, it’s more secure; with AES-NI support, the throughput can reach nearly *2GB/s*, it won’t be the bottleneck any more, AES and CryptoCodec work is supported in HADOOP-10150, HADOOP-10603 and HADOOP-10693 (We may need to add a new mode support for AES). This JIRA will use AES with AES-NI support as encryption algorithm for DataTransferProtocol. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6633) Support reading new data in a being written file until the file is closed
[ https://issues.apache.org/jira/browse/HDFS-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-6633: Attachment: HDFS-6633-001.patch Attached the patch. changes: 1. Added 2 new APIs, {{pollNewData() and isFileUnderConstruction()}} to DFSInputStream and HdfsDataInputStream 2. {{pollNewData()}} should be called after EOF on being-written file. 3. Once it returns true, then can continue reading again. Tried changing the datatransfer protocol to continue reading from the existing stream itself. But I was facing problem in BlockSender. Support reading new data in a being written file until the file is closed - Key: HDFS-6633 URL: https://issues.apache.org/jira/browse/HDFS-6633 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs-client Reporter: Tsz Wo Nicholas Sze Assignee: Vinayakumar B Attachments: HDFS-6633-001.patch, h6633_20140707.patch, h6633_20140708.patch When a file is being written, the file length keeps increasing. If the file is opened for read, the reader first gets the file length and then read only up to that length. The reader will not be able to read the new data written afterward. We propose adding a new feature so that readers will be able to read all the data until the writer closes the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes
gurmukh singh created HDFS-7134: --- Summary: Replication count for a block should not update till the blocks have settled on Datanodes Key: HDFS-7134 URL: https://issues.apache.org/jira/browse/HDFS-7134 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.2.1 Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux [hadoop@nn1 conf]$ cat /etc/redhat-release CentOS release 6.5 (Final) Reporter: gurmukh singh The count for the number of replica's for a block should not change till the blocks have settled on the datanodes. Test Case: Hadoop Cluster with 1 namenode and 3 datanodes. nn1.cluster1.com(192.168.1.70) dn1.cluster1.com(192.168.1.72) dn2.cluster1.com(192.168.1.73) dn3.cluster1.com(192.168.1.74) Cluster up and running fine with replication set to 1 for parameter dfs.replication on all nodes property namedfs.replication/name value1/value /property To reduce the wait time, have reduced the dfs.heartbeat and recheck parameters. on datanode2 (192.168.1.72) [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 / [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2 Found 1 items -rw-r--r-- 2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2 On Namenode === As expected, copy was done from datanode2, one copy will go locally. [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 13:53:16 IST 2014 /from_dn2 17 bytes, 1 block(s): OK 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 192.168.1.73:50010] Can see the blocks on the data nodes disks as well under the current directory. Now, shutdown datanode2(192.168.1.73) and as expected block moves to another datanode to maintain a replication of 2 [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 13:54:21 IST 2014 /from_dn2 17 bytes, 1 block(s): OK 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 192.168.1.72:50010] But, now if i bring back the datanode2, and although the namenode see that this block is at 3 places now and fires a invalidate command for datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 immediately. [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 13:56:12 IST 2014 /from_dn2 17 bytes, 1 block(s): OK 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 192.168.1.72:50010, 192.168.1.73:50010] on Datanode1 - The invalidate command has been fired immediately and the block deleted. = 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: /192.168.1.72:50010 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: /192.168.1.72:50010 size 17 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling blk_8132629811771280764_1175 file /space/disk1/current/blk_8132629811771280764 for deletion 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted blk_8132629811771280764_1175 at file /space/disk1/current/blk_8132629811771280764 The namenode still shows 3 replica's. even if one has been deleted, even after more then 30 mins. [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 14:21:27 IST 2014 /from_dn2 17 bytes, 1 block(s): OK 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 192.168.1.72:50010, 192.168.1.73:50010] This could be a dangerous, if someone remove or other 2 datanodes fail. On Datanode 1 = Before, the datanode1 is brought back [hadoop@dn1 conf]$ ls -l /space/disk*/current /space/disk1/current: total 28 -rw-rw-r-- 1 hadoop hadoop 13 Sep 21 09:09 blk_2278001646987517832 -rw-rw-r-- 1 hadoop hadoop 11 Sep 21 09:09 blk_2278001646987517832_1171.meta -rw-rw-r-- 1 hadoop hadoop 17 Sep 23 13:54 blk_8132629811771280764 -rw-rw-r-- 1 hadoop hadoop 11 Sep 23 13:54 blk_8132629811771280764_1175.meta -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr -rw-rw-r-- 1 hadoop hadoop 157 Sep 23 13:51 VERSION After, starting datanode daemon [hadoop@dn1 conf]$ ls -l /space/disk*/current /space/disk1/current: total 20 -rw-rw-r-- 1 hadoop hadoop 13 Sep 21 09:09 blk_2278001646987517832 -rw-rw-r-- 1 hadoop hadoop 11 Sep 21 09:09 blk_2278001646987517832_1171.meta -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr -rw-rw-r-- 1 hadoop
[jira] [Commented] (HDFS-7097) Allow block reports to be processed during checkpointing on standby name node
[ https://issues.apache.org/jira/browse/HDFS-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144584#comment-14144584 ] Vinayakumar B commented on HDFS-7097: - Thanks Kihwal, Changes looks good. I too have the same question as [~mingma]. 1. though its not frequent, saveNamespace RPC on standby will create the same problem as mentioned in description. can we have different locks for saveNamespace based on HA state? Allow block reports to be processed during checkpointing on standby name node - Key: HDFS-7097 URL: https://issues.apache.org/jira/browse/HDFS-7097 Project: Hadoop HDFS Issue Type: Bug Reporter: Kihwal Lee Assignee: Kihwal Lee Priority: Critical Attachments: HDFS-7097.patch On a reasonably busy HDFS cluster, there are stream of creates, causing data nodes to generate incremental block reports. When a standby name node is checkpointing, RPC handler threads trying to process a full or incremental block report is blocked on the name system's {{fsLock}}, because the checkpointer acquires the read lock on it. This can create a serious problem if the size of name space is big and checkpointing takes a long time. All available RPC handlers can be tied up very quickly. If you have 100 handlers, it only takes 34 file creates. If a separate service RPC port is not used, HA transition will have to wait in the call queue for minutes. Even if a separate service RPC port is configured, hearbeats from datanodes will be blocked. A standby NN with a big name space can lose all data nodes after checkpointing. The rpc calls will also be retransmitted by data nodes many times, filling up the call queue and potentially causing listen queue overflow. Since block reports are not modifying any state that is being saved to fsimage, I propose letting them through during checkpointing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7102) Null dereference in PacketReceiver#receiveNextPacket()
[ https://issues.apache.org/jira/browse/HDFS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144593#comment-14144593 ] Vinayakumar B commented on HDFS-7102: - Any potential problem you are seeing here? Null dereference in PacketReceiver#receiveNextPacket() -- Key: HDFS-7102 URL: https://issues.apache.org/jira/browse/HDFS-7102 Project: Hadoop HDFS Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} public void receiveNextPacket(ReadableByteChannel in) throws IOException { doRead(in, null); {code} doRead() would pass null as second parameter to (line 134): {code} doReadFully(ch, in, curPacketBuf); {code} which dereferences it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7113) Add DFSAdmin Command to Recover Lease
[ https://issues.apache.org/jira/browse/HDFS-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-7113: Fix Version/s: (was: 2.5.1) Add DFSAdmin Command to Recover Lease - Key: HDFS-7113 URL: https://issues.apache.org/jira/browse/HDFS-7113 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.5.0 Reporter: Miklos Christine Priority: Minor Attachments: HDFS-7113.2.patch, HDFS-7113.patch In certain conditions, a lease may be left around if an error occurs while writing to HDFS and the file is left open. Having a DFSAdmin command would allow administrators to recover the lease and close the file easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7115) TestEncryptionZones assumes Unix path separator for KMS key store path
[ https://issues.apache.org/jira/browse/HDFS-7115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144663#comment-14144663 ] Hudson commented on HDFS-7115: -- FAILURE: Integrated in Hadoop-Yarn-trunk #689 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/689/]) HDFS-7115. TestEncryptionZones assumes Unix path separator for KMS key store path. Contributed by Xiaoyu Yao. (cnauroth: rev 26cba7f35ff24262afa5d8f9ed22f3a7f01d9a71) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestEncryptionZones.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt TestEncryptionZones assumes Unix path separator for KMS key store path -- Key: HDFS-7115 URL: https://issues.apache.org/jira/browse/HDFS-7115 Project: Hadoop HDFS Issue Type: Test Components: encryption Affects Versions: 2.5.1 Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao Fix For: 2.6.0 Attachments: HDFS-7115.0.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7001) Tests in TestTracing should not depend on the order of execution
[ https://issues.apache.org/jira/browse/HDFS-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144660#comment-14144660 ] Hudson commented on HDFS-7001: -- FAILURE: Integrated in Hadoop-Yarn-trunk #689 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/689/]) HDFS-7001. Tests in TestTracing should not depend on the order of execution. (iwasakims via cmccabe) (cmccabe: rev 7b8df93ce1b7204a247e64b394d57eef748e73aa) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/tracing/TestTracing.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Tests in TestTracing should not depend on the order of execution Key: HDFS-7001 URL: https://issues.apache.org/jira/browse/HDFS-7001 Project: Hadoop HDFS Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7001-0.patch, HDFS-7001-1.patch o.a.h.tracing.TestTracing#testSpanReceiverHost is assumed to be executed first. It should be done in BeforeClass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7106) Reconfiguring DataNode volumes does not release the lock files in removed volumes.
[ https://issues.apache.org/jira/browse/HDFS-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144669#comment-14144669 ] Hudson commented on HDFS-7106: -- FAILURE: Integrated in Hadoop-Yarn-trunk #689 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/689/]) HDFS-7106. Reconfiguring DataNode volumes does not release the lock files in removed volumes. (cnauroth via cmccabe) (cmccabe: rev 912ad32b03c1e023ab88918bfa8cb356d1851545) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeHotSwapVolumes.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java Reconfiguring DataNode volumes does not release the lock files in removed volumes. -- Key: HDFS-7106 URL: https://issues.apache.org/jira/browse/HDFS-7106 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 2.6.0 Attachments: HDFS-7106.1.patch, HDFS-7106.2.patch, HDFS-7106.3.patch After reconfiguring a DataNode to remove volumes without restarting the DataNode, the process still holds lock files exclusively in all of the volumes that were removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6633) Support reading new data in a being written file until the file is closed
[ https://issues.apache.org/jira/browse/HDFS-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144692#comment-14144692 ] Hadoop QA commented on HDFS-6633: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670663/HDFS-6633-001.patch against trunk revision f557820. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.balancer.TestBalancer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8162//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8162//console This message is automatically generated. Support reading new data in a being written file until the file is closed - Key: HDFS-6633 URL: https://issues.apache.org/jira/browse/HDFS-6633 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs-client Reporter: Tsz Wo Nicholas Sze Assignee: Vinayakumar B Attachments: HDFS-6633-001.patch, h6633_20140707.patch, h6633_20140708.patch When a file is being written, the file length keeps increasing. If the file is opened for read, the reader first gets the file length and then read only up to that length. The reader will not be able to read the new data written afterward. We propose adding a new feature so that readers will be able to read all the data until the writer closes the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7115) TestEncryptionZones assumes Unix path separator for KMS key store path
[ https://issues.apache.org/jira/browse/HDFS-7115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144818#comment-14144818 ] Hudson commented on HDFS-7115: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1880 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1880/]) HDFS-7115. TestEncryptionZones assumes Unix path separator for KMS key store path. Contributed by Xiaoyu Yao. (cnauroth: rev 26cba7f35ff24262afa5d8f9ed22f3a7f01d9a71) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestEncryptionZones.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt TestEncryptionZones assumes Unix path separator for KMS key store path -- Key: HDFS-7115 URL: https://issues.apache.org/jira/browse/HDFS-7115 Project: Hadoop HDFS Issue Type: Test Components: encryption Affects Versions: 2.5.1 Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao Fix For: 2.6.0 Attachments: HDFS-7115.0.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7115) TestEncryptionZones assumes Unix path separator for KMS key store path
[ https://issues.apache.org/jira/browse/HDFS-7115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144795#comment-14144795 ] Hudson commented on HDFS-7115: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1905 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1905/]) HDFS-7115. TestEncryptionZones assumes Unix path separator for KMS key store path. Contributed by Xiaoyu Yao. (cnauroth: rev 26cba7f35ff24262afa5d8f9ed22f3a7f01d9a71) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestEncryptionZones.java TestEncryptionZones assumes Unix path separator for KMS key store path -- Key: HDFS-7115 URL: https://issues.apache.org/jira/browse/HDFS-7115 Project: Hadoop HDFS Issue Type: Test Components: encryption Affects Versions: 2.5.1 Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao Fix For: 2.6.0 Attachments: HDFS-7115.0.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7106) Reconfiguring DataNode volumes does not release the lock files in removed volumes.
[ https://issues.apache.org/jira/browse/HDFS-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144824#comment-14144824 ] Hudson commented on HDFS-7106: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1880 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1880/]) HDFS-7106. Reconfiguring DataNode volumes does not release the lock files in removed volumes. (cnauroth via cmccabe) (cmccabe: rev 912ad32b03c1e023ab88918bfa8cb356d1851545) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeHotSwapVolumes.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Reconfiguring DataNode volumes does not release the lock files in removed volumes. -- Key: HDFS-7106 URL: https://issues.apache.org/jira/browse/HDFS-7106 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 2.6.0 Attachments: HDFS-7106.1.patch, HDFS-7106.2.patch, HDFS-7106.3.patch After reconfiguring a DataNode to remove volumes without restarting the DataNode, the process still holds lock files exclusively in all of the volumes that were removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7106) Reconfiguring DataNode volumes does not release the lock files in removed volumes.
[ https://issues.apache.org/jira/browse/HDFS-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144801#comment-14144801 ] Hudson commented on HDFS-7106: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1905 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1905/]) HDFS-7106. Reconfiguring DataNode volumes does not release the lock files in removed volumes. (cnauroth via cmccabe) (cmccabe: rev 912ad32b03c1e023ab88918bfa8cb356d1851545) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeHotSwapVolumes.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java Reconfiguring DataNode volumes does not release the lock files in removed volumes. -- Key: HDFS-7106 URL: https://issues.apache.org/jira/browse/HDFS-7106 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 2.6.0 Attachments: HDFS-7106.1.patch, HDFS-7106.2.patch, HDFS-7106.3.patch After reconfiguring a DataNode to remove volumes without restarting the DataNode, the process still holds lock files exclusively in all of the volumes that were removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144833#comment-14144833 ] Konstantin Shvachko commented on HDFS-3107: --- Plamen could you please update JavaDocs per the design doc. I found one thing: * Fails if truncate is not to block boundary and someone has lease. This should be removed because block boundary is irrelevant and you already said before that truncate fails if file is not closed. HDFS truncate - Key: HDFS-3107 URL: https://issues.apache.org/jira/browse/HDFS-3107 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Reporter: Lei Chang Assignee: Plamen Jeliazkov Attachments: HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf Original Estimate: 1,344h Remaining Estimate: 1,344h Systems with transaction support often need to undo changes made to the underlying storage when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix operation) which is a reverse operation of append, which makes upper layer applications use ugly workarounds (such as keeping track of the discarded byte range per file in a separate metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7001) Tests in TestTracing should not depend on the order of execution
[ https://issues.apache.org/jira/browse/HDFS-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144792#comment-14144792 ] Hudson commented on HDFS-7001: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1905 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1905/]) HDFS-7001. Tests in TestTracing should not depend on the order of execution. (iwasakims via cmccabe) (cmccabe: rev 7b8df93ce1b7204a247e64b394d57eef748e73aa) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/tracing/TestTracing.java Tests in TestTracing should not depend on the order of execution Key: HDFS-7001 URL: https://issues.apache.org/jira/browse/HDFS-7001 Project: Hadoop HDFS Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7001-0.patch, HDFS-7001-1.patch o.a.h.tracing.TestTracing#testSpanReceiverHost is assumed to be executed first. It should be done in BeforeClass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-3107: -- Attachment: HDFS_truncate.pdf Here is the design doc. HDFS truncate - Key: HDFS-3107 URL: https://issues.apache.org/jira/browse/HDFS-3107 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Reporter: Lei Chang Assignee: Plamen Jeliazkov Attachments: HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf Original Estimate: 1,344h Remaining Estimate: 1,344h Systems with transaction support often need to undo changes made to the underlying storage when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix operation) which is a reverse operation of append, which makes upper layer applications use ugly workarounds (such as keeping track of the discarded byte range per file in a separate metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7001) Tests in TestTracing should not depend on the order of execution
[ https://issues.apache.org/jira/browse/HDFS-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144815#comment-14144815 ] Hudson commented on HDFS-7001: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1880 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1880/]) HDFS-7001. Tests in TestTracing should not depend on the order of execution. (iwasakims via cmccabe) (cmccabe: rev 7b8df93ce1b7204a247e64b394d57eef748e73aa) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/tracing/TestTracing.java Tests in TestTracing should not depend on the order of execution Key: HDFS-7001 URL: https://issues.apache.org/jira/browse/HDFS-7001 Project: Hadoop HDFS Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7001-0.patch, HDFS-7001-1.patch o.a.h.tracing.TestTracing#testSpanReceiverHost is assumed to be executed first. It should be done in BeforeClass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7128) Decommission slows way down when it gets towards the end
[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144865#comment-14144865 ] Kihwal Lee edited comment on HDFS-7128 at 9/23/14 3:00 PM: --- This is not just about decommissioning. If nodes die and a large number of blocks need to replicated, the replication monitor can schedule a large number of blocks in one run and it can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}} is not updated. As you pointed out, gross over-scheduling should be avoided as it causes replication timeout and potentially duplicate replications and invalidations. In my experience, multiple node deaths are commonly caused by DNS or network outages. When it causes a big cluster to lose a large proportion of nodes, the recovery can be very slow because almost every nodes are over-scheduled with replication works that are no longer necessary. This patch will also help in this case. I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}} in the finally block of the try block surrounding {{chooseTarget()}}. Do you think the default soft-limit and hard-limit are reasonable? was (Author: kihwal): This is not just about decommissioning. If nodes die and a large number of blocks need to replicated, the replication monitor can schedule a large number of blocks in one run and it can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}} is not updated. As you pointed out, gross over-scheduling should be avoided as it causes replication timeout and potentially duplicate replications and invalidations. In my experience, multiple node deaths are commonly caused by DNS or network outages. When it causes a big cluster to lose a large proportion of nodes, the recovery can be very slow because almost every nodes are over-scheduled with replication works that are no longer necessary. This patch will also help in this case. I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}} in a finally block surrounding {{chooseTarget()}}. Do you think the default soft-limit and hard-limit are reasonable? Decommission slows way down when it gets towards the end Key: HDFS-7128 URL: https://issues.apache.org/jira/browse/HDFS-7128 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7128.patch When we decommission nodes across different racks, the decommission process becomes really slow at the end, hardly making any progress. The problem is some blocks are on 3 decomm-in-progress DNs and the way how replications are scheduled caused unnecessary delay. Here is the analysis. When BlockManager schedules the replication work from neededReplication, it first needs to pick the source node for replication via chooseSourceDatanode. The core policies to pick the source node are: 1. Prefer decomm-in-progress node. 2. Only pick the nodes whose outstanding replication counts are below thresholds dfs.namenode.replication.max-streams or dfs.namenode.replication.max-streams-hard-limit, based on the replication priority. When we decommission nodes, 1. All the decommission nodes' blocks will be added to neededReplication. 2. BM will pick X number of blocks from neededReplication in each iteration. X is based on cluster size and some configurable multiplier. So if the cluster has 2000 nodes, X will be around 4000. 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up being chosen as the source node of all these 4000 nodes. The reason the outstanding replication thresholds don't kick is due to the implementation of BlockManager.computeReplicationWorkForBlocks; node.getNumberOfBlocksToBeReplicated() remains zero given node.addBlockToBeReplicated is called after source node iteration. {noformat} ... synchronized (neededReplications) { for (int priority = 0; priority blocksToReplicate.size(); priority++) { ... chooseSourceDatanode ... } for(ReplicationWork rw : work){ ... rw.srcNode.addBlockToBeReplicated(block, targets); ... } {noformat} 4. So several decomm-in-progress nodes A, B, C end up with 4000 node.getNumberOfBlocksToBeReplicated(). 5. If we assume each node can replicate 5 blocks per minutes, it is going to take 800 minutes to finish replication of these blocks. 6. Pending replication timeout kick in after 5 minutes. The items will be removed from the pending replication queue and added back to
[jira] [Commented] (HDFS-7114) Secondary NameNode failed to rollback from 2.4.1 to 2.2.0
[ https://issues.apache.org/jira/browse/HDFS-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144882#comment-14144882 ] Kihwal Lee commented on HDFS-7114: -- Secondary name node does not persist any state for its own start up. Clean up the temporary storage and restart. Secondary NameNode failed to rollback from 2.4.1 to 2.2.0 - Key: HDFS-7114 URL: https://issues.apache.org/jira/browse/HDFS-7114 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0 Reporter: sam liu Priority: Blocker Can upgrade from 2.2.0 to 2.4.1, but failed to rollback the secondary namenode with following issue. 2014-09-22 10:41:28,358 FATAL org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Failed to start secondary namenode org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /var/hadoop/tmp/hdfs/dfs/namesecondary. Reported: -56. Expecting = -47. at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1082) at org.apache.hadoop.hdfs.server.common.Storage.setFieldsFromProperties(Storage.java:890) at org.apache.hadoop.hdfs.server.namenode.NNStorage.setFieldsFromProperties(NNStorage.java:585) at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:921) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:913) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:249) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:199) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:652) 2014-09-22 10:41:28,360 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2014-09-22 10:41:28,363 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7128) Decommission slows way down when it gets towards the end
[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144865#comment-14144865 ] Kihwal Lee edited comment on HDFS-7128 at 9/23/14 2:59 PM: --- This is not just about decommissioning. If nodes die and a large number of blocks need to replicated, the replication monitor can schedule a large number of blocks in one run and it can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}} is not updated. As you pointed out, gross over-scheduling should be avoided as it causes replication timeout and potentially duplicate replications and invalidations. In my experience, multiple node deaths are commonly caused by DNS or network outages. When it causes a big cluster to lose a large proportion of nodes, the recovery can be very slow because almost every nodes are over-scheduled with replication works that are no longer necessary. This patch will also help in this case. I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}} in a finally block surrounding {{chooseTarget()}}. Do you think the default soft-limit and hard-limit are reasonable? was (Author: kihwal): This is not just about decommissioning. If nodes die and a large number of blocks need to replicated, the replication monitor can schedule a large number of blocks in one run and it can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}} is not updated. As you pointed out, gross over-scheduling should be avoided as it causes replication timeout and potentially duplicate replication work and invalidation. In my experience, multiple node deaths are commonly caused by DNS or network outages. When it causes a big cluster to lose a large proportion of nodes, the recovery can be very slow because almost every nodes are over-scheduled with replication works that are no longer necessary. This patch will also help in this case. I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}} in a finally block surrounding {{chooseTarget()}}. Do you think the default soft-limit and hard-limit are reasonable? Decommission slows way down when it gets towards the end Key: HDFS-7128 URL: https://issues.apache.org/jira/browse/HDFS-7128 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7128.patch When we decommission nodes across different racks, the decommission process becomes really slow at the end, hardly making any progress. The problem is some blocks are on 3 decomm-in-progress DNs and the way how replications are scheduled caused unnecessary delay. Here is the analysis. When BlockManager schedules the replication work from neededReplication, it first needs to pick the source node for replication via chooseSourceDatanode. The core policies to pick the source node are: 1. Prefer decomm-in-progress node. 2. Only pick the nodes whose outstanding replication counts are below thresholds dfs.namenode.replication.max-streams or dfs.namenode.replication.max-streams-hard-limit, based on the replication priority. When we decommission nodes, 1. All the decommission nodes' blocks will be added to neededReplication. 2. BM will pick X number of blocks from neededReplication in each iteration. X is based on cluster size and some configurable multiplier. So if the cluster has 2000 nodes, X will be around 4000. 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up being chosen as the source node of all these 4000 nodes. The reason the outstanding replication thresholds don't kick is due to the implementation of BlockManager.computeReplicationWorkForBlocks; node.getNumberOfBlocksToBeReplicated() remains zero given node.addBlockToBeReplicated is called after source node iteration. {noformat} ... synchronized (neededReplications) { for (int priority = 0; priority blocksToReplicate.size(); priority++) { ... chooseSourceDatanode ... } for(ReplicationWork rw : work){ ... rw.srcNode.addBlockToBeReplicated(block, targets); ... } {noformat} 4. So several decomm-in-progress nodes A, B, C end up with 4000 node.getNumberOfBlocksToBeReplicated(). 5. If we assume each node can replicate 5 blocks per minutes, it is going to take 800 minutes to finish replication of these blocks. 6. Pending replication timeout kick in after 5 minutes. The items will be removed from the pending replication queue and added back to neededReplication.
[jira] [Commented] (HDFS-7128) Decommission slows way down when it gets towards the end
[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144865#comment-14144865 ] Kihwal Lee commented on HDFS-7128: -- This is not just about decommissioning. If nodes die and a large number of blocks need to replicated, the replication monitor can schedule a large number of blocks in one run and it can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}} is not updated. As you pointed out, gross over-scheduling should be avoided as it causes replication timeout and potentially duplicate replication work and invalidation. In my experience, multiple node deaths are commonly caused by DNS or network outages. When it causes a big cluster to lose a large proportion of nodes, the recovery can be very slow because almost every nodes are over-scheduled with replication works that are no longer necessary. This patch will also help in this case. I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}} in a finally block surrounding {{chooseTarget()}}. Do you think the default soft-limit and hard-limit are reasonable? Decommission slows way down when it gets towards the end Key: HDFS-7128 URL: https://issues.apache.org/jira/browse/HDFS-7128 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7128.patch When we decommission nodes across different racks, the decommission process becomes really slow at the end, hardly making any progress. The problem is some blocks are on 3 decomm-in-progress DNs and the way how replications are scheduled caused unnecessary delay. Here is the analysis. When BlockManager schedules the replication work from neededReplication, it first needs to pick the source node for replication via chooseSourceDatanode. The core policies to pick the source node are: 1. Prefer decomm-in-progress node. 2. Only pick the nodes whose outstanding replication counts are below thresholds dfs.namenode.replication.max-streams or dfs.namenode.replication.max-streams-hard-limit, based on the replication priority. When we decommission nodes, 1. All the decommission nodes' blocks will be added to neededReplication. 2. BM will pick X number of blocks from neededReplication in each iteration. X is based on cluster size and some configurable multiplier. So if the cluster has 2000 nodes, X will be around 4000. 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up being chosen as the source node of all these 4000 nodes. The reason the outstanding replication thresholds don't kick is due to the implementation of BlockManager.computeReplicationWorkForBlocks; node.getNumberOfBlocksToBeReplicated() remains zero given node.addBlockToBeReplicated is called after source node iteration. {noformat} ... synchronized (neededReplications) { for (int priority = 0; priority blocksToReplicate.size(); priority++) { ... chooseSourceDatanode ... } for(ReplicationWork rw : work){ ... rw.srcNode.addBlockToBeReplicated(block, targets); ... } {noformat} 4. So several decomm-in-progress nodes A, B, C end up with 4000 node.getNumberOfBlocksToBeReplicated(). 5. If we assume each node can replicate 5 blocks per minutes, it is going to take 800 minutes to finish replication of these blocks. 6. Pending replication timeout kick in after 5 minutes. The items will be removed from the pending replication queue and added back to neededReplication. The replications will then be handled by other source nodes of these blocks. But the blocks still remain in nodes A, B, C's pending replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue the replications of these blocks, although these blocks might have been replicated by other DNs after replication timeout. 7. Some block' replicas exist on A, B, C and it is at the end of A's pending replication queue. Even though the block's replication timeout, no source node can be chosen given A, B, C all have high pending replication count. So we have to wait until A drains its pending replication queue. Meanwhile, the items in A's pending replication queue have been taken care of by other nodes and no longer under replicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7126) TestEncryptionZonesWithHA assumes Unix path separator for KMS key store path
[ https://issues.apache.org/jira/browse/HDFS-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HDFS-7126: Resolution: Fixed Fix Version/s: 2.6.0 Status: Resolved (was: Patch Available) The test failures are unrelated. I committed this to trunk and branch-2. Xiaoyu, thank you for your help cleaning up these tests. TestEncryptionZonesWithHA assumes Unix path separator for KMS key store path Key: HDFS-7126 URL: https://issues.apache.org/jira/browse/HDFS-7126 Project: Hadoop HDFS Issue Type: Test Components: security, test Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7126.0.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7130) TestDataTransferKeepalive fails intermittently on Windows.
[ https://issues.apache.org/jira/browse/HDFS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144900#comment-14144900 ] Chris Nauroth commented on HDFS-7130: - It looks like the pre-commit job's process got killed somehow. I've submitted a new pre-commit run. TestDataTransferKeepalive fails intermittently on Windows. -- Key: HDFS-7130 URL: https://issues.apache.org/jira/browse/HDFS-7130 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Chris Nauroth Assignee: Chris Nauroth Attachments: HDFS-7130.1.patch {{TestDataTransferKeepalive}} has failed intermittently on Windows. These tests rely on a 1 ms thread sleep to wait for a cache expiration. This is likely too short on Windows, which has been observed to have a less granular clock interrupt period compared to typical Linux machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7135) Add trace command to FsShell
Masatake Iwasaki created HDFS-7135: -- Summary: Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-7135: --- Component/s: hdfs-client Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-7135: --- Component/s: (was: datanode) (was: namenode) Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-7135: --- Attachment: HDFS-7135-0.patch attaching patch. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-7135: --- Attachment: HDFS-7135-1.patch I updated the patch. Trace does not need to use ToolRunner because generic option parsing is done before invoking Trace#run. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7114) Secondary NameNode failed to rollback from 2.4.1 to 2.2.0
[ https://issues.apache.org/jira/browse/HDFS-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee resolved HDFS-7114. -- Resolution: Invalid Secondary NameNode failed to rollback from 2.4.1 to 2.2.0 - Key: HDFS-7114 URL: https://issues.apache.org/jira/browse/HDFS-7114 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0 Reporter: sam liu Priority: Blocker Can upgrade from 2.2.0 to 2.4.1, but failed to rollback the secondary namenode with following issue. 2014-09-22 10:41:28,358 FATAL org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Failed to start secondary namenode org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /var/hadoop/tmp/hdfs/dfs/namesecondary. Reported: -56. Expecting = -47. at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1082) at org.apache.hadoop.hdfs.server.common.Storage.setFieldsFromProperties(Storage.java:890) at org.apache.hadoop.hdfs.server.namenode.NNStorage.setFieldsFromProperties(NNStorage.java:585) at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:921) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:913) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:249) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:199) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:652) 2014-09-22 10:41:28,360 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2014-09-22 10:41:28,363 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-7135: --- Status: Patch Available (was: Open) Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6799) The invalidate method in SimulatedFSDataset.java failed to remove (invalidate) blocks from the file system.
[ https://issues.apache.org/jira/browse/HDFS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144968#comment-14144968 ] Tsz Wo Nicholas Sze commented on HDFS-6799: --- TestBalancer.testUnknownDatanode failed in [build #8162|https://builds.apache.org/job/PreCommit-HDFS-Build/8162//testReport/org.apache.hadoop.hdfs.server.balancer/TestBalancer/testUnknownDatanode/]. It might be related to the change here since there were a lot of ReplicaNotFoundException resulted from SimulatedFSDataset. For example, {noformat} 2014-09-23 11:20:05,974 ERROR datanode.DataNode (DataXceiver.java:run(243)) - host1.foo.com:47137:DataXceiver error processing COPY_BLOCK operation src: /127.0.0.1:36294 dst: /127.0.0.1:47137 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1049218722-67.195.81.148-1411471192218:blk_1073741850_1026 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419) at org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:228) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:918) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:241) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225) at java.lang.Thread.run(Thread.java:662) {noformat} The invalidate method in SimulatedFSDataset.java failed to remove (invalidate) blocks from the file system. --- Key: HDFS-6799 URL: https://issues.apache.org/jira/browse/HDFS-6799 Project: Hadoop HDFS Issue Type: Bug Components: datanode, test Affects Versions: 2.4.1 Reporter: Megasthenis Asteris Assignee: Megasthenis Asteris Priority: Minor Fix For: 2.6.0 Attachments: HDFS-6799.patch The invalidate(String bpid, Block[] invalidBlks) method in SimulatedFSDataset.java should remove all invalidBlks from the simulated file system. It currently fails to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6799) The invalidate method in SimulatedFSDataset.java failed to remove (invalidate) blocks from the file system.
[ https://issues.apache.org/jira/browse/HDFS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144979#comment-14144979 ] Benoy Antony commented on HDFS-6799: Thanks for letting me know [~szetszwo]. I will take a look. The invalidate method in SimulatedFSDataset.java failed to remove (invalidate) blocks from the file system. --- Key: HDFS-6799 URL: https://issues.apache.org/jira/browse/HDFS-6799 Project: Hadoop HDFS Issue Type: Bug Components: datanode, test Affects Versions: 2.4.1 Reporter: Megasthenis Asteris Assignee: Megasthenis Asteris Priority: Minor Fix For: 2.6.0 Attachments: HDFS-6799.patch The invalidate(String bpid, Block[] invalidBlks) method in SimulatedFSDataset.java should remove all invalidBlks from the simulated file system. It currently fails to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6799) The invalidate method in SimulatedFSDataset.java failed to remove (invalidate) blocks from the file system.
[ https://issues.apache.org/jira/browse/HDFS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144990#comment-14144990 ] Tsz Wo Nicholas Sze commented on HDFS-6799: --- Thanks Benoy. The patch here was correct. There might be more bugs in SimulatedFSDataset or TestBalancer.testUnknownDatanode. The fix here simply triggered them. The invalidate method in SimulatedFSDataset.java failed to remove (invalidate) blocks from the file system. --- Key: HDFS-6799 URL: https://issues.apache.org/jira/browse/HDFS-6799 Project: Hadoop HDFS Issue Type: Bug Components: datanode, test Affects Versions: 2.4.1 Reporter: Megasthenis Asteris Assignee: Megasthenis Asteris Priority: Minor Fix For: 2.6.0 Attachments: HDFS-6799.patch The invalidate(String bpid, Block[] invalidBlks) method in SimulatedFSDataset.java should remove all invalidBlks from the simulated file system. It currently fails to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145040#comment-14145040 ] Allen Wittenauer commented on HDFS-7135: This is effectively a dupe of HDFS-6956. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7132) hdfs namenode -metadataVersion command does not honor configured name dirs
[ https://issues.apache.org/jira/browse/HDFS-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145039#comment-14145039 ] Charles Lamb commented on HDFS-7132: TestEncryptionZonesWithKMS and TestPipelinesFailover passed on my local machine with the patch applied. TestWebHdfsFileSystemContract failed with and without the patch. hdfs namenode -metadataVersion command does not honor configured name dirs -- Key: HDFS-7132 URL: https://issues.apache.org/jira/browse/HDFS-7132 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: 2.6.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7132.001.patch The hdfs namenode -metadataVersion command does not honor dfs.namenode.name.dir.nameservice.namenode configuration parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7135: --- Resolution: Duplicate Status: Resolved (was: Patch Available) Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7104) Fix and clarify INodeInPath getter functions
[ https://issues.apache.org/jira/browse/HDFS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-7104: Status: Patch Available (was: Open) Fix and clarify INodeInPath getter functions Key: HDFS-7104 URL: https://issues.apache.org/jira/browse/HDFS-7104 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhe Zhang Assignee: Zhe Zhang Priority: Minor inodes is initialized with the number of patch components. After resolve, it contains both non-null and null elements (introduced by dot-snapshot dirs). When getINodes is called, an array is returned excluding all non elements, which is the correct behavior. Meanwhile, the inodes array is trimmed too, which shouldn't be done by a getter. Because of the above, the behavior of getINodesInPath depends on whether getINodes has been called, which is not correct. The name of getLastINodeInPath is confusing – it actually returns the last non-null inode in the path. Also, shouldn't the return type be a single INode? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145043#comment-14145043 ] stack commented on HDFS-7135: - +1 LGTM. Needs nice release note. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7117) Not all datanodes are displayed on the namenode http tab
[ https://issues.apache.org/jira/browse/HDFS-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145045#comment-14145045 ] Kihwal Lee commented on HDFS-7117: -- I assume you saw this in 2.4. Have you tried branch-2 or release-2.5.*? There have been multiple fixes around the UI since 2.4. Not all datanodes are displayed on the namenode http tab Key: HDFS-7117 URL: https://issues.apache.org/jira/browse/HDFS-7117 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Jean-Baptiste Onofré Fix For: 2.4.0 On a single machine, I have three fake nodes (each node use different dfs.datanode.address, dfs.datanode.ipc.address, dfs.datanode.http.address) - node1 starts the namenode and a datanode - node2 starts a datanode - node3 starts a datanode In the namenode http console, on the overview, I can see 3 live nodes: {code} http://localhost:50070/dfshealth.html#tab-overview {code} but, when clicking on the Live Nodes: {code} http://localhost:50070/dfshealth.html#tab-datanode {code} I can see only one node row. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7104) Fix and clarify INodeInPath getter functions
[ https://issues.apache.org/jira/browse/HDFS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-7104: Attachment: HDFS-7104-20140923-v1.patch [~jingzhao] Thanks very much for the clarification. I also double checked and it seems {{capacity}} is only used to eliminate dot-snapshot elements in {{inodes}}. This patch basically removed {{capacity}} as a field and made it a counter inside {{resolve}}. Fix and clarify INodeInPath getter functions Key: HDFS-7104 URL: https://issues.apache.org/jira/browse/HDFS-7104 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhe Zhang Assignee: Zhe Zhang Priority: Minor Attachments: HDFS-7104-20140923-v1.patch inodes is initialized with the number of patch components. After resolve, it contains both non-null and null elements (introduced by dot-snapshot dirs). When getINodes is called, an array is returned excluding all non elements, which is the correct behavior. Meanwhile, the inodes array is trimmed too, which shouldn't be done by a getter. Because of the above, the behavior of getINodesInPath depends on whether getINodes has been called, which is not correct. The name of getLastINodeInPath is confusing – it actually returns the last non-null inode in the path. Also, shouldn't the return type be a single INode? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7123) Run legacy fsimage checkpoint in parallel with PB fsimage checkpoint
[ https://issues.apache.org/jira/browse/HDFS-7123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145065#comment-14145065 ] Lohit Vijayarenu commented on HDFS-7123: If we drop the lock then users of these images will have to have the knowledge that both are of them could be out of sync. Parallel checkpoint takes more CPU, but that that point it is holding big lock where no other CPU intensive task is going on anyways. Changes look good. +1 on the patch. Run legacy fsimage checkpoint in parallel with PB fsimage checkpoint Key: HDFS-7123 URL: https://issues.apache.org/jira/browse/HDFS-7123 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7123.patch HDFS-7097 will address the checkpoint and BR issue. In addition, it might still be useful to reduce the overall checkpoint duration, given it blocks edit log replay. If there is large volume of edit log to catch up and NN fail overs, it will impact the availability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6971) Bounded staleness of EDEK caches on the NN
[ https://issues.apache.org/jira/browse/HDFS-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145094#comment-14145094 ] Zhe Zhang commented on HDFS-6971: - The {{ValueQueue}} ({{encKeyVersionQueue}}) uses an underlying cache {{keyQueues}}, which is constructed with {{expireAfterAccess}} too. I'll add a test in {{TestKMS}} to verify the time-boundedness. Bounded staleness of EDEK caches on the NN -- Key: HDFS-6971 URL: https://issues.apache.org/jira/browse/HDFS-6971 Project: Hadoop HDFS Issue Type: Sub-task Components: encryption Affects Versions: 2.5.0 Reporter: Andrew Wang Assignee: Zhe Zhang The EDEK cache on the NN can hold onto keys after the admin has rolled the key. It'd be good to time-bound the caches, perhaps also providing an explicit flush command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe reopened HDFS-7135: Not a duplicate. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145103#comment-14145103 ] Colin Patrick McCabe commented on HDFS-7135: bq. This is effectively a dupe of HDFS-6956. HDFS-6956 is about servers, this is about clients. It's not possible to do an RPC to clients to change their tracing configuration because clients normally don't expose a port and a listening service to the world. Plus clients are often short-lived, so by the time you could even do an RPC, the client might have exited. bq. Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. I sort of appreciate the simplicity of this interface, but I'm not sure how I feel about this patch. It seems more consistent to configure client-side tracing by setting a configuration key. I suppose the \-trace command added here can simply override that configuration, though. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6957) Allow DFSClient to manually specify that a request should be traced
[ https://issues.apache.org/jira/browse/HDFS-6957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe resolved HDFS-6957. Resolution: Duplicate Allow DFSClient to manually specify that a request should be traced --- Key: HDFS-6957 URL: https://issues.apache.org/jira/browse/HDFS-6957 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Allow the DFSClient to manually specify that a request should be traced. One easy way to do this might be to have a configuration property that the DFSClient reads which causes it to make all its requests traced. This will allow us to more easily diagnose performance problems with a specific file or client request type. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7055) Add tracing to DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145113#comment-14145113 ] Colin Patrick McCabe commented on HDFS-7055: Jenkins says that there is a new findbugs warning, but looking at: https://builds.apache.org/job/PreCommit-HDFS-Build/8150//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html It says there are 0? Meanwhile {{diffJavacWarnings.txt}} is missing, so I can't evaluate where there is an additional warning or not. Jenkins has been frustrating lately. I will re-trigger this build. Add tracing to DFSInputStream - Key: HDFS-7055 URL: https://issues.apache.org/jira/browse/HDFS-7055 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-7055.002.patch Add tracing to DFSInputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145125#comment-14145125 ] Colin Patrick McCabe commented on HDFS-7135: [~stack], [~iwasakims]: When you get a chance, can you take a look at my patch on HDFS-7055? It adds a new configuration key, {{dfs.client.trace.sampler}}, which the HDFS client can set to control tracing. It's a little more flexible than the approach here because {{dfs.client.trace.sampler}} can be set for arbitrary clients, and it can do probabilistic sampling rather than always-on. Given that we already have ways to inject extra hadoop config keys into the shell via command-line arguments (via \-D and friends), it might be easier just to tell people to use that. Especially given that they will need other configuration to use tracing, like the location of the trace file (if they're using the file sink), and etc. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6988) Add configurable limit for percentage-based eviction threshold
[ https://issues.apache.org/jira/browse/HDFS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145139#comment-14145139 ] Colin Patrick McCabe commented on HDFS-6988: It seems like these configuration keys should be longs, not ints, since we might want values bigger than 2 gigabytes. This scheme seems to be getting a little too complex to easily understand. Rather than having three configuration keys (min, max, percentage), how about a single configuration key that is interpreted differently based on its value? So if it is 10% (has a percent sign at the end) we interpret it as a percentage... if it's 128MB we interpret that as the amount of space to keep free. Add configurable limit for percentage-based eviction threshold -- Key: HDFS-6988 URL: https://issues.apache.org/jira/browse/HDFS-6988 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: HDFS-6581 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: HDFS-6581 Attachments: HDFS-6988.01.patch, HDFS-6988.02.patch Per feedback from [~cmccabe] on HDFS-6930, we can make the eviction thresholds configurable. The hard-coded thresholds may not be appropriate for very large RAM disks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145144#comment-14145144 ] Allen Wittenauer edited comment on HDFS-7135 at 9/23/14 6:04 PM: - Why do we need two commands? Why can't one work for both client and server, just use different options? was (Author: aw): Why do we need two commands? Why can't both work for both client and server, just use different options? Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145144#comment-14145144 ] Allen Wittenauer commented on HDFS-7135: Why do we need two commands? Why can't both work for both client and server, just use different options? Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145152#comment-14145152 ] stack commented on HDFS-7135: - Yeah, not a duplicate. HDFS-6956 adds enable/disable/listing and configuration of server-side tracing sinks. This patch turns on tracing while dfs command runs. Adds simple, low-threshold means of poking at suspected, problematics area in HDFS. This patch is also nice because it hoists tracing up to be a first-class option. It could be argued that tracing doesn't yet have enough meat on it so it may not yet be ready for the spotlight but hopefully that'll be soon addressed. bq. I suppose the -trace command added here can simply override that configuration, though. Sounds good [~cmccabe] Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Plamen Jeliazkov updated HDFS-3107: --- Attachment: HDFS-3107.patch Attaching patch with updated Javadoc. # Changed ClientProtocol JavaDoc to match what Konstantin pointed out. # Also added an @return JavaDoc for ClientProtocol. # Modified the FSDirectory unprotectedTruncate() JavaDoc and comments to remove any notion of 'schedule block for truncate'. The scheduling logic lives in FSNamesystem. More tests to show when dealing with competing appends / creates are necessary. I'll include some in the next patch. HDFS truncate - Key: HDFS-3107 URL: https://issues.apache.org/jira/browse/HDFS-3107 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Reporter: Lei Chang Assignee: Plamen Jeliazkov Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf Original Estimate: 1,344h Remaining Estimate: 1,344h Systems with transaction support often need to undo changes made to the underlying storage when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix operation) which is a reverse operation of append, which makes upper layer applications use ugly workarounds (such as keeping track of the discarded byte range per file in a separate metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145161#comment-14145161 ] Colin Patrick McCabe commented on HDFS-7135: My worry about adding this option is that it won't be useful without setting configuration keys such as {{local-file-span-receiver.path}}. So if you have to tweak the configuration anyway, why not just tweak {{dfs.client.trace.sampler}} while you're at it? Then we don't need this command. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7130) TestDataTransferKeepalive fails intermittently on Windows.
[ https://issues.apache.org/jira/browse/HDFS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145163#comment-14145163 ] Hadoop QA commented on HDFS-7130: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670544/HDFS-7130.1.patch against trunk revision a1fd804. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8163//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8163//console This message is automatically generated. TestDataTransferKeepalive fails intermittently on Windows. -- Key: HDFS-7130 URL: https://issues.apache.org/jira/browse/HDFS-7130 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Chris Nauroth Assignee: Chris Nauroth Attachments: HDFS-7130.1.patch {{TestDataTransferKeepalive}} has failed intermittently on Windows. These tests rely on a 1 ms thread sleep to wait for a cache expiration. This is likely too short on Windows, which has been observed to have a less granular clock interrupt period compared to typical Linux machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7125) Report failures during adding or removing volumes
[ https://issues.apache.org/jira/browse/HDFS-7125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145168#comment-14145168 ] Colin Patrick McCabe commented on HDFS-7125: I guess this should happen through the status-listing RPC, since reconfiguration is asynchronous. Report failures during adding or removing volumes - Key: HDFS-7125 URL: https://issues.apache.org/jira/browse/HDFS-7125 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.5.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu The details of the failures during hot swapping volumes should be reported through RPC to the user who issues the reconfiguration CLI command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7130) TestDataTransferKeepalive fails intermittently on Windows.
[ https://issues.apache.org/jira/browse/HDFS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145171#comment-14145171 ] Chris Nauroth commented on HDFS-7130: - The test failures are unrelated. TestDataTransferKeepalive fails intermittently on Windows. -- Key: HDFS-7130 URL: https://issues.apache.org/jira/browse/HDFS-7130 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Chris Nauroth Assignee: Chris Nauroth Attachments: HDFS-7130.1.patch {{TestDataTransferKeepalive}} has failed intermittently on Windows. These tests rely on a 1 ms thread sleep to wait for a cache expiration. This is likely too short on Windows, which has been observed to have a less granular clock interrupt period compared to typical Linux machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6956) Allow dynamically changing the tracing level in Hadoop servers
[ https://issues.apache.org/jira/browse/HDFS-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145188#comment-14145188 ] Colin Patrick McCabe commented on HDFS-6956: Not sure what the issue is here. All the tests pass for me locally, and the jenkins log is puzzling. I am going to re-trigger the build. Allow dynamically changing the tracing level in Hadoop servers -- Key: HDFS-6956 URL: https://issues.apache.org/jira/browse/HDFS-6956 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-6956.002.patch, HDFS-6956.003.patch, HDFS-6956.004.patch We should allow users to dynamically change the tracing level in Hadoop servers. The easiest way to do this is probably to have an RPC accessible only to the superuser that changes tracing settings. This would allow us to turn on and off tracing on the NameNode, DataNode, etc. at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6988) Add configurable limit for percentage-based eviction threshold
[ https://issues.apache.org/jira/browse/HDFS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145202#comment-14145202 ] Arpit Agarwal commented on HDFS-6988: - Thanks for taking a look at the patch. They are integers as they are replica counts - to be multiplied by the default block length at runtime. A single default simply won't work for a range of ram disk sizes. It will force every administrator to configure one more setting. This way we have reasonable default behavior for most drive sizes, from a few GB up to 100GB. Add configurable limit for percentage-based eviction threshold -- Key: HDFS-6988 URL: https://issues.apache.org/jira/browse/HDFS-6988 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: HDFS-6581 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: HDFS-6581 Attachments: HDFS-6988.01.patch, HDFS-6988.02.patch Per feedback from [~cmccabe] on HDFS-6930, we can make the eviction thresholds configurable. The hard-coded thresholds may not be appropriate for very large RAM disks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7113) Add DFSAdmin Command to Recover Lease
[ https://issues.apache.org/jira/browse/HDFS-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145200#comment-14145200 ] Colin Patrick McCabe commented on HDFS-7113: Yeah, this is a duplicate. Add DFSAdmin Command to Recover Lease - Key: HDFS-7113 URL: https://issues.apache.org/jira/browse/HDFS-7113 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.5.0 Reporter: Miklos Christine Priority: Minor Attachments: HDFS-7113.2.patch, HDFS-7113.patch In certain conditions, a lease may be left around if an error occurs while writing to HDFS and the file is left open. Having a DFSAdmin command would allow administrators to recover the lease and close the file easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7055) Add tracing to DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145221#comment-14145221 ] stack commented on HDFS-7055: - bq. One thing to keep in mind here is that if you call Trace.startSpan with Sampler.NEVER, and there is an existing thread trace span, a subspan will always be created. Thanks for mentioning this up front... first thing I stumbled on looking in code. Its a little confusing but having a comment to explain NEVER in every span open, it'll get annoying fast. Nit: This exception, if it possible to ask trace for the list of options, should list the possible options (I can see folks typing in sampler with wrong case or missing a piece... listing possible options will allow them quickly see what they have done wrong): + throw new RuntimeException(Can't create sampler + samplerStr); Nit: Should we have a convention naming spans [~cmccabe]? For example, method name followed by arg types all in camel case? +dfsClient.getTraceScope(byteBufferRead, src); ... would become readByteBuffer and + dfsClient.getTraceScope(byteArrayRead, src); would be readByteArrayIntInt? Patch looks great to me. You gotten any spans out of it? I can try it if you'd like, no problem. Add tracing to DFSInputStream - Key: HDFS-7055 URL: https://issues.apache.org/jira/browse/HDFS-7055 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-7055.002.patch Add tracing to DFSInputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145224#comment-14145224 ] stack commented on HDFS-7135: - bq. So if you have to tweak the configuration anyway, why not just tweak dfs.client.trace.sampler while you're at it? Then we don't need this command. Sounds reasonable to me [~cmccabe] Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7125) Report failures during adding or removing volumes
[ https://issues.apache.org/jira/browse/HDFS-7125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145235#comment-14145235 ] Lei (Eddy) Xu commented on HDFS-7125: - [~cmccabe] Yes, this is for refining the information and formats from {{-reconfig status}} command. Report failures during adding or removing volumes - Key: HDFS-7125 URL: https://issues.apache.org/jira/browse/HDFS-7125 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.5.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu The details of the failures during hot swapping volumes should be reported through RPC to the user who issues the reconfiguration CLI command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6917) Add an hdfs debug command to validate blocks, call recoverlease, etc.
[ https://issues.apache.org/jira/browse/HDFS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145250#comment-14145250 ] Colin Patrick McCabe commented on HDFS-6917: bq. Can clean up the imports in DebugAdmin, I can tell where it was copy pasted from ok bq. Missing some indentation in the super() calls in each command I wanted to keep it this way so that what was shown in the source file corresponded to what was displayed on the command-line. At the same time, I didn't want to exceed 79 columns as per our coding standard. You can see the dilemma here... if I use normal indentation, I either have to accept less than 79 columns for the command-line output, or exceed the limit for the source code. bq. Need tests  Can we do this in a follow-up? bq. Hardcoding 7 is okay, but slightly better would be 2 + DataChecksum.HEADER_LEN. ok bq. CHECKSUMS_PER_BUF seems kinda large. With 512B per checksum, we're allocating a 64MB data buffer. I figure 8MB would be enough to still get good disk perf. ok bq. metaDidRead is unused removed bq. Could print the current retry count when sleeping/looping ok bq. I expected the default # of retries to be 0, so the command by default tries to do a single recoverLease ok Add an hdfs debug command to validate blocks, call recoverlease, etc. - Key: HDFS-6917 URL: https://issues.apache.org/jira/browse/HDFS-6917 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-6917.001.patch, HDFS-6917.002.patch HDFS should have a debug command which could validate HDFS block files, call recoverLease, and have some other functionality. These commands would be purely for debugging and would appear under a separate command hierarchy inside the hdfs command. There would be no guarantee of API stability for these commands and the debug submenu would not be listed by just typing the hdfs command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6917) Add an hdfs debug command to validate blocks, call recoverlease, etc.
[ https://issues.apache.org/jira/browse/HDFS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-6917: --- Attachment: HDFS-6917.003.patch Add an hdfs debug command to validate blocks, call recoverlease, etc. - Key: HDFS-6917 URL: https://issues.apache.org/jira/browse/HDFS-6917 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-6917.001.patch, HDFS-6917.002.patch, HDFS-6917.003.patch HDFS should have a debug command which could validate HDFS block files, call recoverLease, and have some other functionality. These commands would be purely for debugging and would appear under a separate command hierarchy inside the hdfs command. There would be no guarantee of API stability for these commands and the debug submenu would not be listed by just typing the hdfs command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7129) Metrics to track usage of memory for writes
[ https://issues.apache.org/jira/browse/HDFS-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-7129: Description: A few metrics to evaluate feature usage and suggest improvements. Thanks to [~sureshms] for some of these suggestions. # Number of times a block in memory was read (before being ejected) # Average block size for data written to memory tier # Time the block was in memory before being ejected # Number of blocks written to memory # Number of memory writes requested but not satisfied (failed-over to disk) # Number of blocks evicted without ever being read from memory # Average delay between memory write and disk write (window where a node restart could cause data loss). # Replicas written to disk by lazy writer # Bytes written to disk by lazy writer # Replicas deleted by application before being persisted to disk was: A few metrics to evaluate feature usage and suggest improvements. Thanks to [~sureshms] for some of these suggestions. # Number of times a block in memory was read (before being ejected) # Average block size for data written to memory tier # Time the block was in memory before being ejected # Number of blocks written to memory # Number of memory writes requested but not satisfied (failed-over to disk) # Number of blocks evicted without ever being read from memory # Average delay between memory write and disk write (window where a node restart could cause data loss). Metrics to track usage of memory for writes --- Key: HDFS-7129 URL: https://issues.apache.org/jira/browse/HDFS-7129 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: HDFS-6581 Reporter: Arpit Agarwal A few metrics to evaluate feature usage and suggest improvements. Thanks to [~sureshms] for some of these suggestions. # Number of times a block in memory was read (before being ejected) # Average block size for data written to memory tier # Time the block was in memory before being ejected # Number of blocks written to memory # Number of memory writes requested but not satisfied (failed-over to disk) # Number of blocks evicted without ever being read from memory # Average delay between memory write and disk write (window where a node restart could cause data loss). # Replicas written to disk by lazy writer # Bytes written to disk by lazy writer # Replicas deleted by application before being persisted to disk -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7104) Fix and clarify INodeInPath getter functions
[ https://issues.apache.org/jira/browse/HDFS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145285#comment-14145285 ] Hadoop QA commented on HDFS-7104: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670747/HDFS-7104-20140923-v1.patch against trunk revision a1fd804. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestSnapshotCommands org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8165//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8165//console This message is automatically generated. Fix and clarify INodeInPath getter functions Key: HDFS-7104 URL: https://issues.apache.org/jira/browse/HDFS-7104 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhe Zhang Assignee: Zhe Zhang Priority: Minor Attachments: HDFS-7104-20140923-v1.patch inodes is initialized with the number of patch components. After resolve, it contains both non-null and null elements (introduced by dot-snapshot dirs). When getINodes is called, an array is returned excluding all non elements, which is the correct behavior. Meanwhile, the inodes array is trimmed too, which shouldn't be done by a getter. Because of the above, the behavior of getINodesInPath depends on whether getINodes has been called, which is not correct. The name of getLastINodeInPath is confusing – it actually returns the last non-null inode in the path. Also, shouldn't the return type be a single INode? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7135) Add trace command to FsShell
[ https://issues.apache.org/jira/browse/HDFS-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145284#comment-14145284 ] Hadoop QA commented on HDFS-7135: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670725/HDFS-7135-1.patch against trunk revision a1fd804. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.crypto.random.TestOsSecureRandom org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.tracing.TestTracing org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8164//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8164//console This message is automatically generated. Add trace command to FsShell Key: HDFS-7135 URL: https://issues.apache.org/jira/browse/HDFS-7135 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: HDFS-7135-0.patch, HDFS-7135-1.patch Adding manual tracing command which can be used like 'hdfs dfs -trace -ls /'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7128) Decommission slows way down when it gets towards the end
[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145289#comment-14145289 ] Ming Ma commented on HDFS-7128: --- Thanks, Kihwal. After we finish some cluster level testing, I will update the patch with some unit tests and address your comment. Regarding dfs.namenode.replication.max-streams or dfs.namenode.replication.max-streams-hard-limit, my understanding is the current values are ok, based on block size and heartbeat interval and DN balance bandwidth. If we increase it, it might not help with performance. For example, say block size is 512MB, heartbeat interval is 6s, DN balance bandwidth is 40MB, they it takes around min 12s to replicate one block. DN heartbeat is frequent enough to get the next items in the queue to maintain the max throughput. We can do some evaluation on this. Decommission slows way down when it gets towards the end Key: HDFS-7128 URL: https://issues.apache.org/jira/browse/HDFS-7128 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Assignee: Ming Ma Attachments: HDFS-7128.patch When we decommission nodes across different racks, the decommission process becomes really slow at the end, hardly making any progress. The problem is some blocks are on 3 decomm-in-progress DNs and the way how replications are scheduled caused unnecessary delay. Here is the analysis. When BlockManager schedules the replication work from neededReplication, it first needs to pick the source node for replication via chooseSourceDatanode. The core policies to pick the source node are: 1. Prefer decomm-in-progress node. 2. Only pick the nodes whose outstanding replication counts are below thresholds dfs.namenode.replication.max-streams or dfs.namenode.replication.max-streams-hard-limit, based on the replication priority. When we decommission nodes, 1. All the decommission nodes' blocks will be added to neededReplication. 2. BM will pick X number of blocks from neededReplication in each iteration. X is based on cluster size and some configurable multiplier. So if the cluster has 2000 nodes, X will be around 4000. 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up being chosen as the source node of all these 4000 nodes. The reason the outstanding replication thresholds don't kick is due to the implementation of BlockManager.computeReplicationWorkForBlocks; node.getNumberOfBlocksToBeReplicated() remains zero given node.addBlockToBeReplicated is called after source node iteration. {noformat} ... synchronized (neededReplications) { for (int priority = 0; priority blocksToReplicate.size(); priority++) { ... chooseSourceDatanode ... } for(ReplicationWork rw : work){ ... rw.srcNode.addBlockToBeReplicated(block, targets); ... } {noformat} 4. So several decomm-in-progress nodes A, B, C end up with 4000 node.getNumberOfBlocksToBeReplicated(). 5. If we assume each node can replicate 5 blocks per minutes, it is going to take 800 minutes to finish replication of these blocks. 6. Pending replication timeout kick in after 5 minutes. The items will be removed from the pending replication queue and added back to neededReplication. The replications will then be handled by other source nodes of these blocks. But the blocks still remain in nodes A, B, C's pending replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue the replications of these blocks, although these blocks might have been replicated by other DNs after replication timeout. 7. Some block' replicas exist on A, B, C and it is at the end of A's pending replication queue. Even though the block's replication timeout, no source node can be chosen given A, B, C all have high pending replication count. So we have to wait until A drains its pending replication queue. Meanwhile, the items in A's pending replication queue have been taken care of by other nodes and no longer under replicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6894) Add XDR parser method for each NFS response
[ https://issues.apache.org/jira/browse/HDFS-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145370#comment-14145370 ] Hadoop QA commented on HDFS-6894: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670129/HDFS-6894.001.patch against trunk revision 3dc28e2. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8169//console This message is automatically generated. Add XDR parser method for each NFS response --- Key: HDFS-6894 URL: https://issues.apache.org/jira/browse/HDFS-6894 Project: Hadoop HDFS Issue Type: Sub-task Components: nfs Reporter: Brandon Li Assignee: Brandon Li Attachments: HDFS-6894.001.patch This can be an abstract method in NFS3Response to force the subclasses to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7055) Add tracing to DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145389#comment-14145389 ] Hadoop QA commented on HDFS-7055: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670363/HDFS-7055.002.patch against trunk revision 5338ac4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color}. The applied patch generated 1266 javac compiler warnings (more than the trunk's current 1264 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8166//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8166//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8166//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8166//console This message is automatically generated. Add tracing to DFSInputStream - Key: HDFS-7055 URL: https://issues.apache.org/jira/browse/HDFS-7055 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-7055.002.patch Add tracing to DFSInputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7035) Make adding volume an atomic operation.
[ https://issues.apache.org/jira/browse/HDFS-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-7035: Summary: Make adding volume an atomic operation. (was: Refactor DataStorage and BlockSlicePoolStorage ) Make adding volume an atomic operation. --- Key: HDFS-7035 URL: https://issues.apache.org/jira/browse/HDFS-7035 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.5.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Attachments: HDFS-7035.000.combo.patch, HDFS-7035.000.patch, HDFS-7035.001.combo.patch, HDFS-7035.001.patch, HDFS-7035.002.patch, HDFS-7035.003.patch, HDFS-7035.003.patch, HDFS-7035.004.patch {{DataStorage}} and {{BlockPoolSliceStorage}} share many similar code path. This jira extracts the common part of these two classes to simplify the logic for both. This is the ground work for handling partial failures during hot swapping volumes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7035) Refactor DataStorage and BlockSlicePoolStorage
[ https://issues.apache.org/jira/browse/HDFS-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-7035: Attachment: HDFS-7035.004.patch Make addVolume() become an atomic operation. The volume metadata in {{DataStorage}} and {{FsDataset}} is first load into a local copy. After all I/O finishes, if there is nothing failed, then the {{DataNode}} commits the loaded volume metadata to {{DataStorage}} and {{FsDataset}} respectively. Therefore, if there is any error happened during loading a volume, the metadata belonging to this volume will not be visible to the service. Also it captures the error message for {{IOExceptions}} in {{DataStorage#removeVolumes()}}. Refactor DataStorage and BlockSlicePoolStorage --- Key: HDFS-7035 URL: https://issues.apache.org/jira/browse/HDFS-7035 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.5.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Attachments: HDFS-7035.000.combo.patch, HDFS-7035.000.patch, HDFS-7035.001.combo.patch, HDFS-7035.001.patch, HDFS-7035.002.patch, HDFS-7035.003.patch, HDFS-7035.003.patch, HDFS-7035.004.patch {{DataStorage}} and {{BlockPoolSliceStorage}} share many similar code path. This jira extracts the common part of these two classes to simplify the logic for both. This is the ground work for handling partial failures during hot swapping volumes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7035) Make adding volume an atomic operation.
[ https://issues.apache.org/jira/browse/HDFS-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-7035: Description: It refactors {{DataStorage}} and {{BlockPoolSliceStorage}} to reduce the duplicate code and supports atomic adding volume operations. (was: {{DataStorage}} and {{BlockPoolSliceStorage}} share many similar code path. This jira extracts the common part of these two classes to simplify the logic for both. This is the ground work for handling partial failures during hot swapping volumes.) Make adding volume an atomic operation. --- Key: HDFS-7035 URL: https://issues.apache.org/jira/browse/HDFS-7035 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.5.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Attachments: HDFS-7035.000.combo.patch, HDFS-7035.000.patch, HDFS-7035.001.combo.patch, HDFS-7035.001.patch, HDFS-7035.002.patch, HDFS-7035.003.patch, HDFS-7035.003.patch, HDFS-7035.004.patch It refactors {{DataStorage}} and {{BlockPoolSliceStorage}} to reduce the duplicate code and supports atomic adding volume operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6917) Add an hdfs debug command to validate blocks, call recoverlease, etc.
[ https://issues.apache.org/jira/browse/HDFS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145438#comment-14145438 ] Hadoop QA commented on HDFS-6917: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670775/HDFS-6917.003.patch against trunk revision 3dc28e2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8168//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8168//console This message is automatically generated. Add an hdfs debug command to validate blocks, call recoverlease, etc. - Key: HDFS-6917 URL: https://issues.apache.org/jira/browse/HDFS-6917 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.6.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-6917.001.patch, HDFS-6917.002.patch, HDFS-6917.003.patch HDFS should have a debug command which could validate HDFS block files, call recoverLease, and have some other functionality. These commands would be purely for debugging and would appear under a separate command hierarchy inside the hdfs command. There would be no guarantee of API stability for these commands and the debug submenu would not be listed by just typing the hdfs command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7132) hdfs namenode -metadataVersion command does not honor configured name dirs
[ https://issues.apache.org/jira/browse/HDFS-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-7132: -- Resolution: Fixed Fix Version/s: 2.6.0 Status: Resolved (was: Patch Available) +1 LGTM, committed to trunk and branch-2. Thanks Charles. hdfs namenode -metadataVersion command does not honor configured name dirs -- Key: HDFS-7132 URL: https://issues.apache.org/jira/browse/HDFS-7132 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: 2.6.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7132.001.patch The hdfs namenode -metadataVersion command does not honor dfs.namenode.name.dir.nameservice.namenode configuration parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7132) hdfs namenode -metadataVersion command does not honor configured name dirs
[ https://issues.apache.org/jira/browse/HDFS-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-7132: -- Issue Type: Bug (was: Sub-task) Parent: (was: HDFS-6891) hdfs namenode -metadataVersion command does not honor configured name dirs -- Key: HDFS-7132 URL: https://issues.apache.org/jira/browse/HDFS-7132 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7132.001.patch The hdfs namenode -metadataVersion command does not honor dfs.namenode.name.dir.nameservice.namenode configuration parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-316) Balancer should run for a configurable # of iterations
[ https://issues.apache.org/jira/browse/HDFS-316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145456#comment-14145456 ] Allen Wittenauer commented on HDFS-316: --- Please don't use camel case options. I know the rest of the system does, but they are extremely user unfriendly and something we should start actively avoiding. Balancer should run for a configurable # of iterations -- Key: HDFS-316 URL: https://issues.apache.org/jira/browse/HDFS-316 Project: Hadoop HDFS Issue Type: Improvement Components: balancer Affects Versions: 2.4.1 Reporter: Brian Bockelman Assignee: Xiaoyu Yao Priority: Minor Labels: newbie Attachments: HDFS-316.0.patch The balancer currently exits if nothing has changed after 5 iterations. Our site would like to constantly balance a stream of incoming data; we would like to be able to set the number of iterations it does nothing for before exiting; even better would be if we set it to a negative number and could continuously run this as a daemon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6932) Balancer and Mover tools should ignore replicas on RAM_DISK
[ https://issues.apache.org/jira/browse/HDFS-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoyu Yao updated HDFS-6932: - Attachment: HDFS-6932.1.patch Attach a patch to skip move/balance to/from transient storage and add unit tests. Balancer and Mover tools should ignore replicas on RAM_DISK --- Key: HDFS-6932 URL: https://issues.apache.org/jira/browse/HDFS-6932 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: HDFS-6581 Reporter: Arpit Agarwal Assignee: Xiaoyu Yao Attachments: HDFS-6932.0.patch, HDFS-6932.1.patch Per title, balancer and mover should just ignore replicas on RAM disk instead of attempting to move them to other nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7104) Fix and clarify INodeInPath getter functions
[ https://issues.apache.org/jira/browse/HDFS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145463#comment-14145463 ] Zhe Zhang commented on HDFS-7104: - {{TestSnapshotCommands}} failure reveals that {{FSDirectory#isDir()}} also relies on {{getLastINodeInPath()}} to return a null inode to indicate the path (e.g. /dir/.snapshot) is _not_ a directory. So I don't think we can simply eliminate null elements in {{resolve()}}. I think we need to keep {{resolve()}} as is and rename {{getINodes()}} to {{getNonNullINodes()}}, while making it a real getter (without changing {{inodes}} array). It might also be useful to write another {{getAllINodes()}} method to return both null and non-null inodes. Thoughts? Fix and clarify INodeInPath getter functions Key: HDFS-7104 URL: https://issues.apache.org/jira/browse/HDFS-7104 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhe Zhang Assignee: Zhe Zhang Priority: Minor Attachments: HDFS-7104-20140923-v1.patch inodes is initialized with the number of patch components. After resolve, it contains both non-null and null elements (introduced by dot-snapshot dirs). When getINodes is called, an array is returned excluding all non elements, which is the correct behavior. Meanwhile, the inodes array is trimmed too, which shouldn't be done by a getter. Because of the above, the behavior of getINodesInPath depends on whether getINodes has been called, which is not correct. The name of getLastINodeInPath is confusing – it actually returns the last non-null inode in the path. Also, shouldn't the return type be a single INode? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7130) TestDataTransferKeepalive fails intermittently on Windows.
[ https://issues.apache.org/jira/browse/HDFS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145464#comment-14145464 ] Jing Zhao commented on HDFS-7130: - The patch looks good to me. +1 TestDataTransferKeepalive fails intermittently on Windows. -- Key: HDFS-7130 URL: https://issues.apache.org/jira/browse/HDFS-7130 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Chris Nauroth Assignee: Chris Nauroth Attachments: HDFS-7130.1.patch {{TestDataTransferKeepalive}} has failed intermittently on Windows. These tests rely on a 1 ms thread sleep to wait for a cache expiration. This is likely too short on Windows, which has been observed to have a less granular clock interrupt period compared to typical Linux machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6888) Remove audit logging of getFIleInfo()
[ https://issues.apache.org/jira/browse/HDFS-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated HDFS-6888: -- Attachment: HDFS-6888-4.patch patch updated. Remove audit logging of getFIleInfo() - Key: HDFS-6888 URL: https://issues.apache.org/jira/browse/HDFS-6888 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.5.0 Reporter: Kihwal Lee Assignee: Chen He Labels: log Attachments: HDFS-6888-2.patch, HDFS-6888-3.patch, HDFS-6888-4.patch, HDFS-6888.patch The audit logging of getFileInfo() was added in HDFS-3733. Since this is a one of the most called method, users have noticed that audit log is now filled with this. Since we now have HTTP request logging, this seems unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Plamen Jeliazkov updated HDFS-3107: --- Status: Patch Available (was: Open) HDFS truncate - Key: HDFS-3107 URL: https://issues.apache.org/jira/browse/HDFS-3107 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Reporter: Lei Chang Assignee: Plamen Jeliazkov Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf Original Estimate: 1,344h Remaining Estimate: 1,344h Systems with transaction support often need to undo changes made to the underlying storage when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix operation) which is a reverse operation of append, which makes upper layer applications use ugly workarounds (such as keeping track of the discarded byte range per file in a separate metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6956) Allow dynamically changing the tracing level in Hadoop servers
[ https://issues.apache.org/jira/browse/HDFS-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145487#comment-14145487 ] Hadoop QA commented on HDFS-6956: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670549/HDFS-6956.004.patch against trunk revision 5338ac4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8167//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8167//console This message is automatically generated. Allow dynamically changing the tracing level in Hadoop servers -- Key: HDFS-6956 URL: https://issues.apache.org/jira/browse/HDFS-6956 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-6956.002.patch, HDFS-6956.003.patch, HDFS-6956.004.patch We should allow users to dynamically change the tracing level in Hadoop servers. The easiest way to do this is probably to have an RPC accessible only to the superuser that changes tracing settings. This would allow us to turn on and off tracing on the NameNode, DataNode, etc. at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145497#comment-14145497 ] Hadoop QA commented on HDFS-3107: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670760/HDFS-3107.patch against trunk revision f48686a. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8172//console This message is automatically generated. HDFS truncate - Key: HDFS-3107 URL: https://issues.apache.org/jira/browse/HDFS-3107 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Reporter: Lei Chang Assignee: Plamen Jeliazkov Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf Original Estimate: 1,344h Remaining Estimate: 1,344h Systems with transaction support often need to undo changes made to the underlying storage when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix operation) which is a reverse operation of append, which makes upper layer applications use ugly workarounds (such as keeping track of the discarded byte range per file in a separate metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7130) TestDataTransferKeepalive fails intermittently on Windows.
[ https://issues.apache.org/jira/browse/HDFS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HDFS-7130: Resolution: Fixed Fix Version/s: 2.6.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thank you for the code review, Jing. I committed this to trunk and branch-2. TestDataTransferKeepalive fails intermittently on Windows. -- Key: HDFS-7130 URL: https://issues.apache.org/jira/browse/HDFS-7130 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 2.6.0 Attachments: HDFS-7130.1.patch {{TestDataTransferKeepalive}} has failed intermittently on Windows. These tests rely on a 1 ms thread sleep to wait for a cache expiration. This is likely too short on Windows, which has been observed to have a less granular clock interrupt period compared to typical Linux machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7104) Fix and clarify INodeInPath getter functions
[ https://issues.apache.org/jira/browse/HDFS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-7104: Attachment: HDFS-7104-20140923-v2.patch This patch reflects the above comment on refactoring {{getINodes()}} Fix and clarify INodeInPath getter functions Key: HDFS-7104 URL: https://issues.apache.org/jira/browse/HDFS-7104 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhe Zhang Assignee: Zhe Zhang Priority: Minor Attachments: HDFS-7104-20140923-v1.patch, HDFS-7104-20140923-v2.patch inodes is initialized with the number of patch components. After resolve, it contains both non-null and null elements (introduced by dot-snapshot dirs). When getINodes is called, an array is returned excluding all non elements, which is the correct behavior. Meanwhile, the inodes array is trimmed too, which shouldn't be done by a getter. Because of the above, the behavior of getINodesInPath depends on whether getINodes has been called, which is not correct. The name of getLastINodeInPath is confusing – it actually returns the last non-null inode in the path. Also, shouldn't the return type be a single INode? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7132) hdfs namenode -metadataVersion command does not honor configured name dirs
[ https://issues.apache.org/jira/browse/HDFS-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145528#comment-14145528 ] Charles Lamb commented on HDFS-7132: Thanks for the review [~andrew.wang]. hdfs namenode -metadataVersion command does not honor configured name dirs -- Key: HDFS-7132 URL: https://issues.apache.org/jira/browse/HDFS-7132 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Fix For: 2.6.0 Attachments: HDFS-7132.001.patch The hdfs namenode -metadataVersion command does not honor dfs.namenode.name.dir.nameservice.namenode configuration parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node.
[ https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145535#comment-14145535 ] Colin Patrick McCabe commented on HDFS-7121: I think it's probably good enough to just check if all JournalNodes are present before sending out the {{doPreUpgrade}} message. This guards against the administrative misconfiguration case, or the case where one or more journal nodes are down. It's true that we could experience a failure in between that check and the pre-upgrade operation, but the chances of that happening are very low. If it does happen, it will simply result in a JN being dropped out of the quorum later, which monitoring tools will pick up, and admins will fix. I'm pretty sure that there isn't a complete solution to this problem because it can be reduced to the Two Generals Problem. For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node. -- Key: HDFS-7121 URL: https://issues.apache.org/jira/browse/HDFS-7121 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node Reporter: Chris Nauroth Several JournalNode operations are not satisfied by a quorum. They must succeed on every JournalNode in the cluster. If the operation succeeds on some nodes, but fails on others, then this may leave the nodes in an inconsistent state and require operations to do manual recovery steps. For example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node, then the operator will need to correct the problem on the failed node and also manually restore the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node.
[ https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145556#comment-14145556 ] Chris Nauroth commented on HDFS-7121: - bq. I think it's probably good enough to just check if all JournalNodes are present before sending out the doPreUpgrade message. Hi Colin. This is coming out of a production support issue in which some invalid file system permissions caused the rename from current to previous.tmp to fail on 1 out of 3 JournalNodes. There weren't any nodes down. A pre-check like you suggested wouldn't have helped protect against this, because the failure wouldn't show up until actually attempting to do the work. For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node. -- Key: HDFS-7121 URL: https://issues.apache.org/jira/browse/HDFS-7121 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node Reporter: Chris Nauroth Several JournalNode operations are not satisfied by a quorum. They must succeed on every JournalNode in the cluster. If the operation succeeds on some nodes, but fails on others, then this may leave the nodes in an inconsistent state and require operations to do manual recovery steps. For example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node, then the operator will need to correct the problem on the failed node and also manually restore the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7136) libhdfs doesn't compile on OS X
Allen Wittenauer created HDFS-7136: -- Summary: libhdfs doesn't compile on OS X Key: HDFS-7136 URL: https://issues.apache.org/jira/browse/HDFS-7136 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer vecsum uses clock_gettime which isn't supported on OS X. Like Windows, we just need to ignore that bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7136) libhdfs doesn't compile on OS X
[ https://issues.apache.org/jira/browse/HDFS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7136: --- Attachment: HDFS-7136.patch -00: Change the cmakefile to only build vecsum if the clock_gettime routine exists. libhdfs doesn't compile on OS X --- Key: HDFS-7136 URL: https://issues.apache.org/jira/browse/HDFS-7136 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer Attachments: HDFS-7136.patch vecsum uses clock_gettime which isn't supported on OS X. Like Windows, we just need to ignore that bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7136) libhdfs fails compile step on OS X
[ https://issues.apache.org/jira/browse/HDFS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7136: --- Summary: libhdfs fails compile step on OS X (was: libhdfs doesn't compile on OS X) libhdfs fails compile step on OS X -- Key: HDFS-7136 URL: https://issues.apache.org/jira/browse/HDFS-7136 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer Attachments: HDFS-7136.patch vecsum uses clock_gettime which isn't supported on OS X. Like Windows, we just need to ignore that bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7104) Fix and clarify INodeInPath getter functions
[ https://issues.apache.org/jira/browse/HDFS-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145567#comment-14145567 ] Jing Zhao commented on HDFS-7104: - Thanks for the analysis, [~zhz]. For all the callers of {{getINodes}}, we have the following two cases: # The path is a non-snapshot path. In this case the {{getINodes}} returns inodes directly, which includes null elements. # The path is a snapshot path (including paths ending with dot-snapshot). In this case, the current {{getINodes}} trims elements to make the length of {{inodes}} equal to the value of {{capacity}}. Note that in this case null elements may still be contained in {{inodes}} (otherwise we cannot identify non-existing files/directories in snapshots). So I can see three options here. The first option is to keep the current {{getINodes}} unchanged, but adding more javadoc to explain the logic behind. The second option is to make it a real getter, but we cannot rename it to {{getNonNullINodes}} since null elements can still be included. The third option is to also create an extra method {{INodesInPath#getINodesForWrite}} for the above case #1, which first does a sanity check to make sure {{capacity == inodes.length}}, and then returns {{inodes}} directly. This method can be called by write ops like mkdir, concat, delete, etc. Since we do not usually call {{getINodes}} multiple times for the same INodesInPath instance, I think we may consider starting from option 2. Fix and clarify INodeInPath getter functions Key: HDFS-7104 URL: https://issues.apache.org/jira/browse/HDFS-7104 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhe Zhang Assignee: Zhe Zhang Priority: Minor Attachments: HDFS-7104-20140923-v1.patch, HDFS-7104-20140923-v2.patch inodes is initialized with the number of patch components. After resolve, it contains both non-null and null elements (introduced by dot-snapshot dirs). When getINodes is called, an array is returned excluding all non elements, which is the correct behavior. Meanwhile, the inodes array is trimmed too, which shouldn't be done by a getter. Because of the above, the behavior of getINodesInPath depends on whether getINodes has been called, which is not correct. The name of getLastINodeInPath is confusing – it actually returns the last non-null inode in the path. Also, shouldn't the return type be a single INode? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7136) libhdfs fails compile step on OS X
[ https://issues.apache.org/jira/browse/HDFS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7136: --- Fix Version/s: (was: HDFS-6534) libhdfs fails compile step on OS X -- Key: HDFS-7136 URL: https://issues.apache.org/jira/browse/HDFS-7136 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer Attachments: HDFS-7136.patch vecsum uses clock_gettime which isn't supported on OS X. Like Windows, we just need to ignore that bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7136) libhdfs fails compile step on OS X
[ https://issues.apache.org/jira/browse/HDFS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved HDFS-7136. Resolution: Duplicate Fix Version/s: HDFS-6534 HDFS-6534 is a better fix. Closing as a dupe. libhdfs fails compile step on OS X -- Key: HDFS-7136 URL: https://issues.apache.org/jira/browse/HDFS-7136 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer Fix For: HDFS-6534 Attachments: HDFS-7136.patch vecsum uses clock_gettime which isn't supported on OS X. Like Windows, we just need to ignore that bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)