[jira] [Commented] (HDFS-1490) TransferFSImage should timeout
[ https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442992#comment-13442992 ] Ravi Prakash commented on HDFS-1490: +1 lgtm TransferFSImage should timeout -- Key: HDFS-1490 URL: https://issues.apache.org/jira/browse/HDFS-1490 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Dmytro Molkov Assignee: Dmytro Molkov Priority: Minor Attachments: HDFS-1490.patch, HDFS-1490.patch Sometimes when primary crashes during image transfer secondary namenode would hang trying to read the image from HTTP connection forever. It would be great to set timeouts on the connection so if something like that happens there is no need to restart the secondary itself. In our case restarting components is handled by the set of scripts and since the Secondary as the process is running it would just stay hung until we get an alarm saying the checkpointing doesn't happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3540) Further improvement on recovery mode and edit log toleration in branch-1
[ https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443002#comment-13443002 ] Tsz Wo (Nicholas), SZE commented on HDFS-3540: -- If the edit log is not corrupted, both recovery mode and edit log toleration are not useful. Note that recovery mode here means recovery mode in branch-1 but not the one in trunk. When an edit log is corrupted, NN cannot start up normally. We compare recovery mode and edit log toleration below. *Recovery Mode* - Recovery here means starting NN with a corrupted edit log. It is unable to recover the corrupted edit log or transactions. - There is a namenode command option hadoop namenode -recover to enter recovery mode. - When reading the first corrupted transaction in the edit log, it prompts the admin to either stop reading or quit without saving. - If stop reading is selected, NN ignores the remaining edit log (from the first corrupted transaction to the end of the edit log) and then starts up as usual. - There is a -force option to FORCE_FIRST_CHOICE, i.e. it is a non-interactive mode. - If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss. Recovery Mode does not help. (Please help out if I have missed anything.) *Edit Log Toleration* - It has a conf property dfs.namenode.edits.toleration.length for setting the toleration length. - The default toleration length is -1, i.e. disable it. The feature is enabled when the value = 0. - When the feature is enabled, it always reads the entire edit log, computes read length, corruption length and padding length and shows the following summary {noformat} 2012-08-27 22:04:38,625 INFO - Checked the bytes after the end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits): 2012-08-27 22:04:38,625 INFO - Padding position = 876 (-1 means padding not found) 2012-08-27 22:04:38,625 INFO - Edit log length = 1065 2012-08-27 22:04:38,625 INFO - Read length = 168 2012-08-27 22:04:38,625 INFO - Corruption length = 708 2012-08-27 22:04:38,625 INFO - Toleration length = 1024 (= dfs.namenode.edits.toleration.length) 2012-08-27 22:04:38,626 INFO - Summary: |-- Read=168 --|-- Corrupt=708 --|-- Pad=189 --| 2012-08-27 22:04:38,626 WARN - Edit log corruption detected: corruption length = 708 = toleration length = 1024; the corruption is tolerable. {noformat} - When toleration length is set to 0, it makes sure that there is no corruption in the entire log, including the padding. A stray OP_INVALID byte won't be misinterpreted as an end-of-log. - When toleration length is set to 0, NN starts up only if corruption length = toleration length. If corruption length toleration length, it throws an exception as below {noformat} 2012-08-27 22:04:39,123 INFO - Start checking end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits) ... 2012-08-27 22:04:39,123 DEBUG - found: bytes[0]=0xFF=pad, firstPadPos=169 2012-08-27 22:04:39,123 DEBUG - reset: bytes[1410]=0xAB, pad=0xFF 2012-08-27 22:04:39,124 DEBUG - found: bytes[1411]=0xFF=pad, firstPadPos=1580 2012-08-27 22:04:39,124 INFO - Checked the bytes after the end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits): 2012-08-27 22:04:39,124 INFO - Padding position = 1580 (-1 means padding not found) 2012-08-27 22:04:39,124 INFO - Edit log length = 2638 2012-08-27 22:04:39,124 INFO - Read length = 169 2012-08-27 22:04:39,124 INFO - Corruption length = 1411 2012-08-27 22:04:39,124 INFO - Toleration length = 1024 (= dfs.namenode.edits.toleration.length) 2012-08-27 22:04:39,125 INFO - Summary: |-- Read=169 --|-- Corrupt=1411 --|-- Pad=1058 --| 2012-08-27 22:04:39,125 ERROR - FSNamesystem initialization failed. java.io.IOException: Edit log corruption detected: corruption length = 1411 toleration length = 1024; the corruption is intolerable. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkEndOfLog(FSEditLog.java:609) ... {noformat} - Therefore, the recommanded setting is to set the conf to 0 (or a small number). When corruption is detected (i.e. NN cannot start up), the corruption length can be read from the log. Then, admin could decide whether they are willing to tolerate the corruption or they could try to recover the edit log by other means. Further improvement on recovery mode and edit log toleration in branch-1 Key: HDFS-3540 URL: https://issues.apache.org/jira/browse/HDFS-3540 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.2.0 Reporter: Tsz Wo (Nicholas), SZE Assignee:
[jira] [Updated] (HDFS-1490) TransferFSImage should timeout
[ https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1490: -- Status: Patch Available (was: Open) TransferFSImage should timeout -- Key: HDFS-1490 URL: https://issues.apache.org/jira/browse/HDFS-1490 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Dmytro Molkov Assignee: Dmytro Molkov Priority: Minor Attachments: HDFS-1490.patch, HDFS-1490.patch Sometimes when primary crashes during image transfer secondary namenode would hang trying to read the image from HTTP connection forever. It would be great to set timeouts on the connection so if something like that happens there is no need to restart the secondary itself. In our case restarting components is handled by the set of scripts and since the Secondary as the process is running it would just stay hung until we get an alarm saying the checkpointing doesn't happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1490) TransferFSImage should timeout
[ https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443034#comment-13443034 ] Hadoop QA commented on HDFS-1490: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542723/HDFS-1490.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3105//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3105//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3105//console This message is automatically generated. TransferFSImage should timeout -- Key: HDFS-1490 URL: https://issues.apache.org/jira/browse/HDFS-1490 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Dmytro Molkov Assignee: Dmytro Molkov Priority: Minor Attachments: HDFS-1490.patch, HDFS-1490.patch Sometimes when primary crashes during image transfer secondary namenode would hang trying to read the image from HTTP connection forever. It would be great to set timeouts on the connection so if something like that happens there is no need to restart the secondary itself. In our case restarting components is handled by the set of scripts and since the Secondary as the process is running it would just stay hung until we get an alarm saying the checkpointing doesn't happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1490) TransferFSImage should timeout
[ https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443051#comment-13443051 ] Vinay commented on HDFS-1490: - {code} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock){code} Find bug warning unrelated to current patch {code}Failed tests: testHdfsDelegationToken(org.apache.hadoop.hdfs.TestHftpDelegationToken): wrong tokens in user expected:2 but was:1{code} Also test failure is unrelated to current patch TransferFSImage should timeout -- Key: HDFS-1490 URL: https://issues.apache.org/jira/browse/HDFS-1490 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Dmytro Molkov Assignee: Dmytro Molkov Priority: Minor Attachments: HDFS-1490.patch, HDFS-1490.patch Sometimes when primary crashes during image transfer secondary namenode would hang trying to read the image from HTTP connection forever. It would be great to set timeouts on the connection so if something like that happens there is no need to restart the secondary itself. In our case restarting components is handled by the set of scripts and since the Secondary as the process is running it would just stay hung until we get an alarm saying the checkpointing doesn't happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3847) using NFS As a shared storage for NameNode HA , how to ensure that only one write
[ https://issues.apache.org/jira/browse/HDFS-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned HDFS-3847: --- Assignee: (was: Devaraj K) using NFS As a shared storage for NameNode HA , how to ensure that only one write - Key: HDFS-3847 URL: https://issues.apache.org/jira/browse/HDFS-3847 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.0.0-alpha, 2.0.1-alpha Reporter: liaowenrui Priority: Critical Fix For: 2.0.0-alpha -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3847) using NFS As a shared storage for NameNode HA , how to ensure that only one write
[ https://issues.apache.org/jira/browse/HDFS-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned HDFS-3847: --- Assignee: Devaraj K using NFS As a shared storage for NameNode HA , how to ensure that only one write - Key: HDFS-3847 URL: https://issues.apache.org/jira/browse/HDFS-3847 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.0.0-alpha, 2.0.1-alpha Reporter: liaowenrui Assignee: Devaraj K Priority: Critical Fix For: 2.0.0-alpha -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443056#comment-13443056 ] Suresh Srinivas commented on HDFS-3860: --- Jing, nice find. Submitting the patch. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443058#comment-13443058 ] Suresh Srinivas commented on HDFS-3860: --- BTW could you please also ensure that this pattern of code is not repeated in any other places. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Srinivas updated HDFS-3860: -- Status: Patch Available (was: Open) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes
[ https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443075#comment-13443075 ] Suresh Srinivas commented on HDFS-3791: --- Uma sorry for the delay in reviewing this. +1 for the patch. Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes Key: HDFS-3791 URL: https://issues.apache.org/jira/browse/HDFS-3791 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.0.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Attachments: HDFS-3791.patch, HDFS-3791.patch Backport HDFS-173. see the [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007] for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes
[ https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Srinivas updated HDFS-3791: -- Attachment: HDFS-3791.patch Rebased the patch on latest branch-1 Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes Key: HDFS-3791 URL: https://issues.apache.org/jira/browse/HDFS-3791 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.0.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch Backport HDFS-173. see the [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007] for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes
[ https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Srinivas resolved HDFS-3791. --- Resolution: Fixed Fix Version/s: 1.2.0 Hadoop Flags: Reviewed I committed the patch. Thank you Uma. Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes Key: HDFS-3791 URL: https://issues.apache.org/jira/browse/HDFS-3791 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.0.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Fix For: 1.2.0 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch Backport HDFS-173. see the [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007] for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes
[ https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443103#comment-13443103 ] Uma Maheswara Rao G commented on HDFS-3791: --- Oh, I have just seen the comments. {quote} Uma sorry for the delay in reviewing this. +1 for the patch. {quote} No problem :-). Thanks a lot, Suresh for the reviews. Also thanks for rebasing it. I will to get a patch for HDFS-2815 today in some time. Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes Key: HDFS-3791 URL: https://issues.apache.org/jira/browse/HDFS-3791 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.0.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Fix For: 1.2.0 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch Backport HDFS-173. see the [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007] for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443105#comment-13443105 ] Hadoop QA commented on HDFS-3860: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542695/HDFS-3860.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3106//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3106//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3106//console This message is automatically generated. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning
[ https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443113#comment-13443113 ] Suresh Srinivas commented on HDFS-3837: --- Seems to me the findbugs is not fixed by the new patch or is it Jenkins error. Fixing this issue quickly will help. Currently all Jenkins reports have findbugs -1 for precommit tests. {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) Bug type EC_UNRELATED_TYPES (click for details) In class org.apache.hadoop.hdfs.server.datanode.DataNode In method org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) Actual type org.apache.hadoop.hdfs.protocol.DatanodeInfo Expected org.apache.hadoop.hdfs.server.protocol.DatanodeRegistration Value loaded from id Value loaded from bpReg org.apache.hadoop.hdfs.server.protocol.DatanodeRegistration.equals(Object) used to determine equality At DataNode.java:[line 1869] {noformat} Fix DataNode.recoverBlock findbugs warning -- Key: HDFS-3837 URL: https://issues.apache.org/jira/browse/HDFS-3837 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Eli Collins Attachments: hdfs-3837.txt, hdfs-3837.txt HDFS-2686 introduced the following findbugs warning: {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) {noformat} Both are using DatanodeID#equals but it's a different method because DNR#equals overrides equals for some reason (doesn't change behavior). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3856) TestHDFSServerPorts failure is causing surefire fork failure
[ https://issues.apache.org/jira/browse/HDFS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443123#comment-13443123 ] Hudson commented on HDFS-3856: -- Integrated in Hadoop-Hdfs-trunk #1148 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1148/]) Fixup CHANGELOG for HDFS-3856. (Revision 1377936) HDFS-3856. TestHDFSServerPorts failure is causing surefire fork failure. Contributed by Colin Patrick McCabe (Revision 1377934) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377936 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377934 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java TestHDFSServerPorts failure is causing surefire fork failure Key: HDFS-3856 URL: https://issues.apache.org/jira/browse/HDFS-3856 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.2.0-alpha Reporter: Thomas Graves Assignee: Eli Collins Priority: Blocker Fix For: 2.2.0-alpha Attachments: hdfs-3856.txt, hdfs-3856.txt We have been seeing the hdfs tests on trunk and branch-2 error out with fork failures. I see the hadoop jenkins trunk build is also seeing these: https://builds.apache.org/view/Hadoop/job/Hadoop-trunk/lastCompletedBuild/console -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3856) TestHDFSServerPorts failure is causing surefire fork failure
[ https://issues.apache.org/jira/browse/HDFS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443151#comment-13443151 ] Hudson commented on HDFS-3856: -- Integrated in Hadoop-Mapreduce-trunk #1179 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1179/]) Fixup CHANGELOG for HDFS-3856. (Revision 1377936) HDFS-3856. TestHDFSServerPorts failure is causing surefire fork failure. Contributed by Colin Patrick McCabe (Revision 1377934) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377936 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377934 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java TestHDFSServerPorts failure is causing surefire fork failure Key: HDFS-3856 URL: https://issues.apache.org/jira/browse/HDFS-3856 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.2.0-alpha Reporter: Thomas Graves Assignee: Eli Collins Priority: Blocker Fix For: 2.2.0-alpha Attachments: hdfs-3856.txt, hdfs-3856.txt We have been seeing the hdfs tests on trunk and branch-2 error out with fork failures. I see the hadoop jenkins trunk build is also seeing these: https://builds.apache.org/view/Hadoop/job/Hadoop-trunk/lastCompletedBuild/console -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225
[ https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-3852: -- Attachment: HDFS-3852.patch The test is attempting to insert two tokens with the same service. The UGI's private creds is a list which happily accepted tokens with duplicate services and even duplicate tokens. When I changed UGI in HADOOP-8225 to allow extraction of a {{Credentials}} object from the UGI, it broke the test because {{Credentials}} uses a map for tokens which naturally doesn't allow for service dups. The test is really trying to ensure the correct token is retrieved for htftp so I changed the 2nd token to have a different service to prevent it replacing the first token. Arguably, multiple tokens for the same service with different kinds should be permissible. However in practice that is/was not possible because a {{Credentials}} (which doesn't allow service dups) is used to build up tokens to be dumped into the UGI. TestHftpDelegationToken is broken after HADOOP-8225 --- Key: HDFS-3852 URL: https://issues.apache.org/jira/browse/HDFS-3852 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, security Affects Versions: 0.23.3, 2.1.0-alpha Reporter: Aaron T. Myers Assignee: Daryn Sharp Attachments: HDFS-3852.patch It's been failing in all builds for the last 2 days or so. Git bisect indicates that it's due to HADOOP-8225. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225
[ https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-3852: -- Status: Patch Available (was: Open) TestHftpDelegationToken is broken after HADOOP-8225 --- Key: HDFS-3852 URL: https://issues.apache.org/jira/browse/HDFS-3852 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, security Affects Versions: 0.23.3, 2.1.0-alpha Reporter: Aaron T. Myers Assignee: Daryn Sharp Attachments: HDFS-3852.patch It's been failing in all builds for the last 2 days or so. Git bisect indicates that it's due to HADOOP-8225. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225
[ https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443212#comment-13443212 ] Aaron T. Myers commented on HDFS-3852: -- Got it. Makes sense. Thanks for the explanation, Daryn, and thanks for looking into this issue. The patch looks good to me. +1 pending Jenkins. TestHftpDelegationToken is broken after HADOOP-8225 --- Key: HDFS-3852 URL: https://issues.apache.org/jira/browse/HDFS-3852 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, security Affects Versions: 0.23.3, 2.1.0-alpha Reporter: Aaron T. Myers Assignee: Daryn Sharp Attachments: HDFS-3852.patch It's been failing in all builds for the last 2 days or so. Git bisect indicates that it's due to HADOOP-8225. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0
[ https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443221#comment-13443221 ] Robert Joseph Evans commented on HDFS-3731: --- Any update on branch-0.23? Do you want me to look into it? 2.0 release upgrade must handle blocks being written from 1.0 - Key: HDFS-3731 URL: https://issues.apache.org/jira/browse/HDFS-3731 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Suresh Srinivas Assignee: Colin Patrick McCabe Priority: Blocker Fix For: 2.2.0-alpha Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy. The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning
[ https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-3837: -- Attachment: hdfs-3837.txt The findbugs warning seems bogus - This method calls equals(Object) on two references of different class types with no common subclasses. Therefore, the objects being compared are unlikely to be members of the same class at runtime. Both DatanodeInfo and DatanodeRegistration extend DatanodeID so they both share the equals implementation. Anyway, I'll put the relevant code back (cast the array) since this fixes the findbugs warning is is fine (just more verbose). {code} -DatanodeID[] datanodeids = rBlock.getLocations(); +DatanodeInfo[] targets = rBlock.getLocations(); +DatanodeID[] datanodeids = (DatanodeID[])targets; {code} Updated patch, includes the comments as well so it's clear both classes are using the same equals method. Fix DataNode.recoverBlock findbugs warning -- Key: HDFS-3837 URL: https://issues.apache.org/jira/browse/HDFS-3837 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Eli Collins Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt HDFS-2686 introduced the following findbugs warning: {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) {noformat} Both are using DatanodeID#equals but it's a different method because DNR#equals overrides equals for some reason (doesn't change behavior). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes
[ https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443245#comment-13443245 ] Ted Yu commented on HDFS-3791: -- Currently small deletion is determined by the constant BLOCK_DELETION_INCREMENT: {code} + deleteNow = collectedBlocks.size() = BLOCK_DELETION_INCREMENT; {code} I wonder if there is use case where the increment should be configurable. Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes Key: HDFS-3791 URL: https://issues.apache.org/jira/browse/HDFS-3791 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.0.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Fix For: 1.2.0 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch Backport HDFS-173. see the [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007] for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225
[ https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443264#comment-13443264 ] Hadoop QA commented on HDFS-3852: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542779/HDFS-3852.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3107//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3107//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3107//console This message is automatically generated. TestHftpDelegationToken is broken after HADOOP-8225 --- Key: HDFS-3852 URL: https://issues.apache.org/jira/browse/HDFS-3852 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, security Affects Versions: 0.23.3, 2.1.0-alpha Reporter: Aaron T. Myers Assignee: Daryn Sharp Attachments: HDFS-3852.patch It's been failing in all builds for the last 2 days or so. Git bisect indicates that it's due to HADOOP-8225. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3861) Deadlock in DFSClient
Kihwal Lee created HDFS-3861: Summary: Deadlock in DFSClient Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443269#comment-13443269 ] Kihwal Lee commented on HDFS-3861: -- DFSClient#getLeaseRenewer() doesn't have to be synchronized since LeaseManager.Factory methods are synchronized. Multiple callers are still guaranteed to get a single live renewer back. {noformat} Java stack information for the threads listed above: === Thread-28: at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1729) - waiting to lock 0xb5a05dc8 (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:674) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:691) - locked 0xb5a06ed8 (a org.apache.hadoop.hdfs.DFSClient) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:539) at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2386) - locked 0xb44b00e8 (a org.apache.hadoop.fs.FileSystem$Cache) at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2403) - locked 0xb44b0100 (a org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Thread-1175: at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:538) - waiting to lock 0xb5a06ed8 (a org.apache.hadoop.hdfs.DFSClient) at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:550) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1757) - locked 0xb5a05dc8 (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:66) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:99) at org.apache.hadoop.hdfs.TestDatanodeDeath$Workload.run(TestDatanodeDeath.java:101) {noformat} Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443271#comment-13443271 ] Aaron T. Myers commented on HDFS-3860: -- Oof, good catch, Jing. Fortunately this case seems like it would be pretty tough to hit, since if the NN is in SM then HeartbeatManager#heartbeatCheck will return early, so to hit this the NN would have to enter SM in a very short window of time. Still certainly worth fixing, though. The patch looks good to me. The findbugs warning is unrelated and TestHftpDelegationToken is known to currently be failing. +1, I'll commit this momentarily. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-3861: - Attachment: hdfs-3861.patch.txt Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-3861: - Status: Patch Available (was: Open) Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3860: - Resolution: Fixed Fix Version/s: 2.2.0-alpha Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've just committed this to trunk and branch-2. Thanks a lot for the contribution, Jing. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning
[ https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443286#comment-13443286 ] Suresh Srinivas commented on HDFS-3837: --- If this is a findbugs issue, why not just add this to findbugs exclude? Fix DataNode.recoverBlock findbugs warning -- Key: HDFS-3837 URL: https://issues.apache.org/jira/browse/HDFS-3837 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Eli Collins Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt HDFS-2686 introduced the following findbugs warning: {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) {noformat} Both are using DatanodeID#equals but it's a different method because DNR#equals overrides equals for some reason (doesn't change behavior). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443289#comment-13443289 ] Suresh Srinivas commented on HDFS-3860: --- Thanks Aaron for committing the patch. bq. BTW could you please also ensure that this pattern of code is not repeated in any other places. Going back to my previous comment, Jing, if possible can you also see if there other such issues. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443292#comment-13443292 ] Jing Zhao commented on HDFS-3860: - I just checked all the invocation of namesystem#writelock / writeunlock, and did not find similar problems. I will check other similar code too. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes
[ https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443296#comment-13443296 ] Suresh Srinivas commented on HDFS-3791: --- when I added this in trunk, I was not sure if there is a usecase. The whole idea was to give up lock once deleting some number of blocks. So the number currently is arbitrary. Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes Key: HDFS-3791 URL: https://issues.apache.org/jira/browse/HDFS-3791 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.0.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Fix For: 1.2.0 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch Backport HDFS-173. see the [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007] for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee reassigned HDFS-3861: Assignee: Kihwal Lee Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Assignee: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2815) Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed.
[ https://issues.apache.org/jira/browse/HDFS-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-2815: -- Attachment: HDFS-2815-branch-1.patch Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed. -- Key: HDFS-2815 URL: https://issues.apache.org/jira/browse/HDFS-2815 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.22.0, 0.24.0, 0.23.1, 1.0.0, 1.1.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Priority: Critical Fix For: 2.0.0-alpha, 3.0.0 Attachments: HDFS-2815-22-branch.patch, HDFS-2815-branch-1.patch, HDFS-2815-Branch-1.patch, HDFS-2815.patch, HDFS-2815.patch When tested the HA(internal) with continuous switch with some 5mins gap, found some *blocks missed* and namenode went into safemode after next switch. After the analysis, i found that this files already deleted by clients. But i don't see any delete commands logs namenode log files. But namenode added that blocks to invalidateSets and DNs deleted the blocks. When restart of the namenode, it went into safemode and expecting some more blocks to come out of safemode. Here the reason could be that, file has been deleted in memory and added into invalidates after this it is trying to sync the edits into editlog file. By that time NN asked DNs to delete that blocks. Now namenode shuts down before persisting to editlogs.( log behind) Due to this reason, we may not get the INFO logs about delete, and when we restart the Namenode (in my scenario it is again switch), Namenode expects this deleted blocks also, as delete request is not persisted into editlog before. I reproduced this scenario with bedug points. *I feel, We should not add the blocks to invalidates before persisting into Editlog*. Note: for switch, we used kill -9 (force kill) I am currently in 0.20.2 version. Same verified in 0.23 as well in normal crash + restart scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3373) FileContext HDFS implementation can leak socket caches
[ https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John George updated HDFS-3373: -- Status: Open (was: Patch Available) FileContext HDFS implementation can leak socket caches -- Key: HDFS-3373 URL: https://issues.apache.org/jira/browse/HDFS-3373 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Todd Lipcon Assignee: John George Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, and thus never calls DFSClient.close(). This means that, until finalizers run, DFSClient will hold on to its SocketCache object and potentially have a lot of outstanding sockets/fds held on to. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3373) FileContext HDFS implementation can leak socket caches
[ https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John George updated HDFS-3373: -- Attachment: HDFS-3373.trunk.patch.1 TestConnCache failure is related to this JIRA. I had moved testDisableCache() from that test to another test file because now it is not possible to change cache config per DFS. TestHftpDelegationToken is unrelated to this patch and has been failing in other builds as well. Attaching a patch with testDisableCache() removed from TestConnCache to a new file FileContext HDFS implementation can leak socket caches -- Key: HDFS-3373 URL: https://issues.apache.org/jira/browse/HDFS-3373 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Todd Lipcon Assignee: John George Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch, HDFS-3373.trunk.patch.1 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, and thus never calls DFSClient.close(). This means that, until finalizers run, DFSClient will hold on to its SocketCache object and potentially have a lot of outstanding sockets/fds held on to. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3373) FileContext HDFS implementation can leak socket caches
[ https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John George updated HDFS-3373: -- Status: Patch Available (was: Open) FileContext HDFS implementation can leak socket caches -- Key: HDFS-3373 URL: https://issues.apache.org/jira/browse/HDFS-3373 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Todd Lipcon Assignee: John George Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch, HDFS-3373.trunk.patch.1 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, and thus never calls DFSClient.close(). This means that, until finalizers run, DFSClient will hold on to its SocketCache object and potentially have a lot of outstanding sockets/fds held on to. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: recovery-mode.pdf Here is an updated Recovery Mode design document. Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Fix For: 2.0.0-alpha Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.043.patch, HDFS-3004__namenode_recovery_tool.txt, recovery-mode.pdf When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443338#comment-13443338 ] Hudson commented on HDFS-3860: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2680 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2680/]) HDFS-3860. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem. Contributed by Jing Zhao. (Revision 1378228) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378228 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443351#comment-13443351 ] Hadoop QA commented on HDFS-3861: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542787/hdfs-3861.patch.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3109//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3109//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3109//console This message is automatically generated. Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Assignee: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443353#comment-13443353 ] Hudson commented on HDFS-3860: -- Integrated in Hadoop-Common-trunk-Commit #2651 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2651/]) HDFS-3860. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem. Contributed by Jing Zhao. (Revision 1378228) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378228 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443367#comment-13443367 ] Hudson commented on HDFS-3860: -- Integrated in Hadoop-Hdfs-trunk-Commit #2715 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2715/]) HDFS-3860. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem. Contributed by Jing Zhao. (Revision 1378228) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378228 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3540) Further improvement on recovery mode and edit log toleration in branch-1
[ https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443375#comment-13443375 ] Colin Patrick McCabe commented on HDFS-3540: Hi Nicholas, Your summary seems reasonable to me overall. I agree with you that the recommended setting for edit log toleration should be disabled. Is there anything left to do for this JIRA? Further improvement on recovery mode and edit log toleration in branch-1 Key: HDFS-3540 URL: https://issues.apache.org/jira/browse/HDFS-3540 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.2.0 Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE *Recovery Mode*: HDFS-3479 backported HDFS-3335 to branch-1. However, the recovery mode feature in branch-1 is dramatically different from the recovery mode in trunk since the edit log implementations in these two branch are different. For example, there is UNCHECKED_REGION_LENGTH in branch-1 but not in trunk. *Edit Log Toleration*: HDFS-3521 added this feature to branch-1 to remedy UNCHECKED_REGION_LENGTH and to tolerate edit log corruption. There are overlaps between these two features. We study potential further improvement in this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0
[ https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443377#comment-13443377 ] Colin Patrick McCabe commented on HDFS-3731: bq. Any update on branch-0.23? Do you want me to look into it? There are some differences in the branch-0.23 BlockManager state machine, such that a straight port of the patch doesn't work. The easiest thing to do would probably be to backport some of the BlockManager fixes and improvements to branch-0.23. If you would look into that it would be good. 2.0 release upgrade must handle blocks being written from 1.0 - Key: HDFS-3731 URL: https://issues.apache.org/jira/browse/HDFS-3731 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Suresh Srinivas Assignee: Colin Patrick McCabe Priority: Blocker Fix For: 2.2.0-alpha Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy. The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning
[ https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443387#comment-13443387 ] Hadoop QA commented on HDFS-3837: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542780/hdfs-3837.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3108//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3108//console This message is automatically generated. Fix DataNode.recoverBlock findbugs warning -- Key: HDFS-3837 URL: https://issues.apache.org/jira/browse/HDFS-3837 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Eli Collins Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt HDFS-2686 introduced the following findbugs warning: {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) {noformat} Both are using DatanodeID#equals but it's a different method because DNR#equals overrides equals for some reason (doesn't change behavior). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2815) Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed.
[ https://issues.apache.org/jira/browse/HDFS-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443394#comment-13443394 ] Hadoop QA commented on HDFS-2815: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542794/HDFS-2815-branch-1.patch against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3111//console This message is automatically generated. Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed. -- Key: HDFS-2815 URL: https://issues.apache.org/jira/browse/HDFS-2815 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.22.0, 0.24.0, 0.23.1, 1.0.0, 1.1.0 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Priority: Critical Fix For: 2.0.0-alpha, 3.0.0 Attachments: HDFS-2815-22-branch.patch, HDFS-2815-branch-1.patch, HDFS-2815-Branch-1.patch, HDFS-2815.patch, HDFS-2815.patch When tested the HA(internal) with continuous switch with some 5mins gap, found some *blocks missed* and namenode went into safemode after next switch. After the analysis, i found that this files already deleted by clients. But i don't see any delete commands logs namenode log files. But namenode added that blocks to invalidateSets and DNs deleted the blocks. When restart of the namenode, it went into safemode and expecting some more blocks to come out of safemode. Here the reason could be that, file has been deleted in memory and added into invalidates after this it is trying to sync the edits into editlog file. By that time NN asked DNs to delete that blocks. Now namenode shuts down before persisting to editlogs.( log behind) Due to this reason, we may not get the INFO logs about delete, and when we restart the Namenode (in my scenario it is again switch), Namenode expects this deleted blocks also, as delete request is not persisted into editlog before. I reproduced this scenario with bedug points. *I feel, We should not add the blocks to invalidates before persisting into Editlog*. Note: for switch, we used kill -9 (force kill) I am currently in 0.20.2 version. Same verified in 0.23 as well in normal crash + restart scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443401#comment-13443401 ] Kihwal Lee commented on HDFS-3861: -- - The test failures are not related to this patch. - No test was added. Existing test case exposed this bug (TestDataNodeDeath). - The findbugs warning is not caused by this patch. Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Assignee: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.
[ https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3849: --- Attachment: HDFS-3849.003.patch * don't set DT config When re-loading the FSImage, we should clear the existing genStamp and leases. -- Key: HDFS-3849 URL: https://issues.apache.org/jira/browse/HDFS-3849 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Critical Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, HDFS-3849.003.patch When re-loading the FSImage, we should clear the existing genStamp and leases. This is an issue in the 2NN, because it sometimes clears the existing FSImage and reloads a new one in order to get back in sync with the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.
[ https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443445#comment-13443445 ] Aaron T. Myers commented on HDFS-3849: -- +1 pending Jenkins. When re-loading the FSImage, we should clear the existing genStamp and leases. -- Key: HDFS-3849 URL: https://issues.apache.org/jira/browse/HDFS-3849 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Critical Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, HDFS-3849.003.patch When re-loading the FSImage, we should clear the existing genStamp and leases. This is an issue in the 2NN, because it sometimes clears the existing FSImage and reloads a new one in order to get back in sync with the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3861) Deadlock in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443463#comment-13443463 ] Colin Patrick McCabe commented on HDFS-3861: Looks good to me. Deadlock in DFSClient - Key: HDFS-3861 URL: https://issues.apache.org/jira/browse/HDFS-3861 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha Reporter: Kihwal Lee Assignee: Kihwal Lee Priority: Blocker Fix For: 0.23.4, 3.0.0, 2.2.0-alpha Attachments: hdfs-3861.patch.txt The deadlock is between DFSOutputStream#close() and DFSClient#close(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3859) QJM: implement md5sum verification
[ https://issues.apache.org/jira/browse/HDFS-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443476#comment-13443476 ] Steve Loughran commented on HDFS-3859: -- Isn't MD5 overkill? Can't a good CRC (like TCP Jumbo Frames uses) suffice? QJM: implement md5sum verification -- Key: HDFS-3859 URL: https://issues.apache.org/jira/browse/HDFS-3859 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: QuorumJournalManager (HDFS-3077) Reporter: Todd Lipcon Assignee: Todd Lipcon When the QJM passes journal segments between nodes, it should use an md5sum field to make sure the data doesn't get corrupted during transit. This also serves as an extra safe-guard to make sure that the data is consistent across all nodes when finalizing a segment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3859) QJM: implement md5sum verification
[ https://issues.apache.org/jira/browse/HDFS-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443483#comment-13443483 ] Todd Lipcon commented on HDFS-3859: --- Sure, it's overkill, but it's not that expensive and we already have an implementation of it sitting around. It's also handy because md5sum is commonly available on the command line, and we use it for FSImages already as well. Performance-wise, my laptop can md5sum at about 500MB/sec, so given that log segments under recovery are likely to be much smaller than 500M, I don't think we should be concerned about that. QJM: implement md5sum verification -- Key: HDFS-3859 URL: https://issues.apache.org/jira/browse/HDFS-3859 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: QuorumJournalManager (HDFS-3077) Reporter: Todd Lipcon Assignee: Todd Lipcon When the QJM passes journal segments between nodes, it should use an md5sum field to make sure the data doesn't get corrupted during transit. This also serves as an extra safe-guard to make sure that the data is consistent across all nodes when finalizing a segment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3862) QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics
Todd Lipcon created HDFS-3862: - Summary: QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics Key: HDFS-3862 URL: https://issues.apache.org/jira/browse/HDFS-3862 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: QuorumJournalManager (HDFS-3077) Reporter: Todd Lipcon Currently, NN HA requires that the administrator configure a fencing method to ensure that only a single NameNode may write to the shared storage at a time. Some shared edits storage implementations (like QJM) inherently enforce single-writer semantics at the storage level, and thus the user should not be forced to specify one. We should extend the JournalManager interface so that the HA code can operate without a configured fencer if the JM has such built-in fencing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3862) QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics
[ https://issues.apache.org/jira/browse/HDFS-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443518#comment-13443518 ] Todd Lipcon commented on HDFS-3862: --- I think this might be the case for BookKeeper as well. Any of the folks working on BKJM want to take this on? I anticipate we would add a simple API to JournalManager like: {{boolean isNativelySingleWriter();}} or {{boolean needsExternalFencing();}}. Then the failover code could check the shared storage dir to see if this is the case, and if so, not error out if the user doesn't specify a fence method. QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics --- Key: HDFS-3862 URL: https://issues.apache.org/jira/browse/HDFS-3862 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: QuorumJournalManager (HDFS-3077) Reporter: Todd Lipcon Currently, NN HA requires that the administrator configure a fencing method to ensure that only a single NameNode may write to the shared storage at a time. Some shared edits storage implementations (like QJM) inherently enforce single-writer semantics at the storage level, and thus the user should not be forced to specify one. We should extend the JournalManager interface so that the HA code can operate without a configured fencer if the JM has such built-in fencing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3373) FileContext HDFS implementation can leak socket caches
[ https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443524#comment-13443524 ] Hadoop QA commented on HDFS-3373: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542795/HDFS-3373.trunk.patch.1 against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3110//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3110//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3110//console This message is automatically generated. FileContext HDFS implementation can leak socket caches -- Key: HDFS-3373 URL: https://issues.apache.org/jira/browse/HDFS-3373 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Todd Lipcon Assignee: John George Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch, HDFS-3373.trunk.patch.1 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, and thus never calls DFSClient.close(). This means that, until finalizers run, DFSClient will hold on to its SocketCache object and potentially have a lot of outstanding sockets/fds held on to. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1490) TransferFSImage should timeout
[ https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443526#comment-13443526 ] Todd Lipcon commented on HDFS-1490: --- - I dont like reusing the ipc ping interval for this timeout here. It's from an entirely separate module, and I don't see why one should correlate to the other. Why not introduce a new config which defaults to something like 1 minute? - In the test case, shouldn't you somehow notify the servlet to exit? Currently it waits on itself, but nothing notifies it. TransferFSImage should timeout -- Key: HDFS-1490 URL: https://issues.apache.org/jira/browse/HDFS-1490 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Dmytro Molkov Assignee: Dmytro Molkov Priority: Minor Attachments: HDFS-1490.patch, HDFS-1490.patch Sometimes when primary crashes during image transfer secondary namenode would hang trying to read the image from HTTP connection forever. It would be great to set timeouts on the connection so if something like that happens there is no need to restart the secondary itself. In our case restarting components is handled by the set of scripts and since the Secondary as the process is running it would just stay hung until we get an alarm saying the checkpointing doesn't happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.
[ https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443541#comment-13443541 ] Hadoop QA commented on HDFS-3849: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542806/HDFS-3849.003.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3112//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3112//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3112//console This message is automatically generated. When re-loading the FSImage, we should clear the existing genStamp and leases. -- Key: HDFS-3849 URL: https://issues.apache.org/jira/browse/HDFS-3849 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Critical Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, HDFS-3849.003.patch When re-loading the FSImage, we should clear the existing genStamp and leases. This is an issue in the 2NN, because it sometimes clears the existing FSImage and reloads a new one in order to get back in sync with the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3863) QJM: track last committed txid
Todd Lipcon created HDFS-3863: - Summary: QJM: track last committed txid Key: HDFS-3863 URL: https://issues.apache.org/jira/browse/HDFS-3863 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: QuorumJournalManager (HDFS-3077) Reporter: Todd Lipcon Assignee: Todd Lipcon Per some discussion with [~stepinto] [here|https://issues.apache.org/jira/browse/HDFS-3077?focusedCommentId=13422579page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422579], we should keep track of the last committed txid on each JournalNode. Then during any recovery operation, we can sanity-check that we aren't asked to truncate a log to an earlier transaction. This is also a necessary step if we want to support reading from in-progress segments in the future (since we should only allow reads up to the commit point) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0
[ https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443551#comment-13443551 ] Robert Joseph Evans commented on HDFS-3731: --- Do you have a list of ones you know about? If not I can start pulling on that thread tomorrow. 2.0 release upgrade must handle blocks being written from 1.0 - Key: HDFS-3731 URL: https://issues.apache.org/jira/browse/HDFS-3731 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Suresh Srinivas Assignee: Colin Patrick McCabe Priority: Blocker Fix For: 2.2.0-alpha Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy. The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3863) QJM: track last committed txid
[ https://issues.apache.org/jira/browse/HDFS-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443556#comment-13443556 ] Todd Lipcon commented on HDFS-3863: --- The design here is pretty simple, given the way our journaling protocol works. In particular, we only have one outstanding batch of transactions at once. We never send a batch of transactions beginning at txid N until the prior batch (up through N-1) has been accepted at a quorum of nodes. Thus, any {{sendEdits()}} call with {{firstTxId}} N implies a {{commit(N-1)}}. So, my plan is as follows: - Introduce a new file inside the journal directory called {{committed-txid}}. This would include a single numeric text line, similar to the {{seen_txid}} that the NameNode maintains. - Since this whole feature is not required for correctness, we don't need to fsync this file on every update. Instead, we can let the operating system write it out to disk whenever it so chooses. If, after a system crash, it reverts to an earlier value, this is OK, since our recovery protocol doesn't depend on it being up-to-date in any way. Put another way, the invariant is that the file contains a value which is a lower bound on the latest committed txn. The data would be when any sendEdits() call is made -- the call implicitly commits all edits prior to the current batch. This alone is enough for a good sanity check. If we want to also support reading the committed transactions while in-progress, it's not quite sufficient -- the last batch of transactions will never be readable if the NN stops writing new batches for a protracted period of time. To solve this, we can add a timer thread to the client which periodically (eg once or twice a second) sends an RPC to update the committed-txid on all of the nodes. The periodic timer will also have the nice property of causing a NN which has been fenced to abort itself even if no write transactions are taking place. QJM: track last committed txid Key: HDFS-3863 URL: https://issues.apache.org/jira/browse/HDFS-3863 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: QuorumJournalManager (HDFS-3077) Reporter: Todd Lipcon Assignee: Todd Lipcon Per some discussion with [~stepinto] [here|https://issues.apache.org/jira/browse/HDFS-3077?focusedCommentId=13422579page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422579], we should keep track of the last committed txid on each JournalNode. Then during any recovery operation, we can sanity-check that we aren't asked to truncate a log to an earlier transaction. This is also a necessary step if we want to support reading from in-progress segments in the future (since we should only allow reads up to the commit point) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0
[ https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443577#comment-13443577 ] Colin Patrick McCabe commented on HDFS-3731: bq. Do you have a list of ones you know about? If not I can start pulling on that thread tomorrow. Sorry, I just took a preliminary look, didn't have time to go in depth. The state machine errors are pretty clear in the test. You may need to wait a while for them to appear since surefire does a lot of buffering. 2.0 release upgrade must handle blocks being written from 1.0 - Key: HDFS-3731 URL: https://issues.apache.org/jira/browse/HDFS-3731 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Suresh Srinivas Assignee: Colin Patrick McCabe Priority: Blocker Fix For: 2.2.0-alpha Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy. The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
Aaron T. Myers created HDFS-3864: Summary: NN does not update internal file mtime for OP_CLOSE when reading from the edit log Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3733) Audit logs should include WebHDFS access
[ https://issues.apache.org/jira/browse/HDFS-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443584#comment-13443584 ] Andy Isaacson commented on HDFS-3733: - OK, backing up -- I think my addition of CurClient just duplicates functionality already provided by NamenodeWebHdfsMethods#REMOTE_ADDRESS . So I can drop that new ThreadLocal and just teach NameNodeRpcServer to use REMOTE_ADDRESS appropriately. Or am I missing something? bq. getRemoteIp should not just return NamenodeWebHdfsMethods#getRemoteAddress (I assume you are referring to my newly added {{FSNamesystem#getRemoteIp}}.) Agreed, FSNamesystem should support all remote methods: RPC, WebHdfs ... and Hftp? The {{FSNamesystem#getRemoteIp}} should handle them all. The helper {{NameNodeRpcServer#getRemoteIp}} implements the WebHdfs portion of {{FSNamesystem#getRemoteIp}} just as {{Server#getRemoteIp}} implements the RPC portion. Audit logs should include WebHDFS access Key: HDFS-3733 URL: https://issues.apache.org/jira/browse/HDFS-3733 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Affects Versions: 2.0.0-alpha Reporter: Andy Isaacson Assignee: Andy Isaacson Attachments: hdfs-3733.txt Access via WebHdfs does not result in audit log entries. It should. {noformat} % curl http://nn1:50070/webhdfs/v1/user/adi/hello.txt?op=GETFILESTATUS; {FileStatus:{accessTime:1343351432395,blockSize:134217728,group:supergroup,length:12,modificationTime:1342808158399,owner:adi,pathSuffix:,permission:644,replication:1,type:FILE}} {noformat} and observe that no audit log entry is generated. Interestingly, OPEN requests do not generate audit log entries when the NN generates the redirect, but do generate audit log entries when the second phase against the DN is executed. {noformat} % curl -v 'http://nn1:50070/webhdfs/v1/user/adi/hello.txt?op=OPEN' ... HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://dn01:50075/webhdfs/v1/user/adi/hello.txt?op=OPENnamenoderpcaddress=nn1:8020offset=0 ... % curl -v 'http://dn01:50075/webhdfs/v1/user/adi/hello.txt?op=OPENnamenoderpcaddress=nn1:8020' ... HTTP/1.1 200 OK Content-Type: application/octet-stream Content-Length: 12 Server: Jetty(6.1.26.cloudera.1) hello world {noformat} This happens because {{DatanodeWebHdfsMethods#get}} uses {{DFSClient#open}} thereby triggering the existing {{logAuditEvent}} code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3864: - Status: Patch Available (was: Open) NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3864: - Attachment: HDFS-3864.patch Here's a patch which addresses the issue. Fortunately, the fix is quite simply - just apply the values that we read in from the edit log. In addition to the automated test provided in the patch, I also tested this manually on an HA cluster and confirmed that MR jobs no longer experience the :distributed cache object changed errors which caused this issue to be discovered. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3865) TestDistCp is @ignored
Colin Patrick McCabe created HDFS-3865: -- Summary: TestDistCp is @ignored Key: HDFS-3865 URL: https://issues.apache.org/jira/browse/HDFS-3865 Project: Hadoop HDFS Issue Type: Test Components: tools Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Priority: Minor We should fix TestDistCp so that it actually runs, rather than being ignored. {code} @ignore public class TestDistCp { private static final Log LOG = LogFactory.getLog(TestDistCp.class); private static ListPath pathList = new ArrayListPath(); ... {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.
[ https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3849: - Resolution: Fixed Fix Version/s: 2.2.0-alpha Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've just committed this to trunk and branch-2. Thanks a lot for the contribution, Colin. When re-loading the FSImage, we should clear the existing genStamp and leases. -- Key: HDFS-3849 URL: https://issues.apache.org/jira/browse/HDFS-3849 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Critical Fix For: 2.2.0-alpha Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, HDFS-3849.003.patch When re-loading the FSImage, we should clear the existing genStamp and leases. This is an issue in the 2NN, because it sometimes clears the existing FSImage and reloads a new one in order to get back in sync with the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3864: - Attachment: HDFS-3864.patch Thanks a lot for the quick review, Todd. Here's an updated patch which lowers the sleep time to 10 milliseconds. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.
[ https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443598#comment-13443598 ] Hudson commented on HDFS-3849: -- Integrated in Hadoop-Hdfs-trunk-Commit #2716 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2716/]) HDFS-3849. When re-loading the FSImage, we should clear the existing genStamp and leases. Contributed by Colin Patrick McCabe. (Revision 1378364) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378364 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystem.java When re-loading the FSImage, we should clear the existing genStamp and leases. -- Key: HDFS-3849 URL: https://issues.apache.org/jira/browse/HDFS-3849 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Critical Fix For: 2.2.0-alpha Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, HDFS-3849.003.patch When re-loading the FSImage, we should clear the existing genStamp and leases. This is an issue in the 2NN, because it sometimes clears the existing FSImage and reloads a new one in order to get back in sync with the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443586#comment-13443586 ] Aaron T. Myers edited comment on HDFS-3864 at 8/29/12 9:21 AM: --- Here's a patch which addresses the issue. Fortunately, the fix is quite simple - just apply the values that we read in from the edit log. In addition to the automated test provided in the patch, I also tested this manually on an HA cluster and confirmed that MR jobs no longer experience the distributed cache object changed errors which caused this issue to be discovered. was (Author: atm): Here's a patch which addresses the issue. Fortunately, the fix is quite simply - just apply the values that we read in from the edit log. In addition to the automated test provided in the patch, I also tested this manually on an HA cluster and confirmed that MR jobs no longer experience the :distributed cache object changed errors which caused this issue to be discovered. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2264) NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo annotation
[ https://issues.apache.org/jira/browse/HDFS-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443610#comment-13443610 ] Aaron T. Myers commented on HDFS-2264: -- Hey Jitendra, sorry for forgetting about this JIRA for so long (almost exactly a year!) I just encountered this issue again in a user's cluster. My new thinking is that we should just remove the expected client principal from the NamenodeProtocol entirely. I think this makes sense the 2NN, SBN, BN, and balancer all potentially use this interface, so there's no single client principal that could reasonably be expected. The balancer, in particular, should be able to be run from any node, even one not running a daemon at all. I think to do what I propose here all we have to do is remove the clientPrincipal parameter from the SecurityInfo annotation on the NamenodeProtocol, and make sure that all of the methods exposed by this interface definitely check for super user privileges. I think most of them do, but we should ensure that they all do. How does this sound to you? NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo annotation --- Key: HDFS-2264 URL: https://issues.apache.org/jira/browse/HDFS-2264 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Aaron T. Myers Assignee: Harsh J Fix For: 0.24.0 Attachments: HDFS-2264.r1.diff The {{@KerberosInfo}} annotation specifies the expected server and client principals for a given protocol in order to look up the correct principal name from the config. The {{NamenodeProtocol}} has the wrong value for the client config key. This wasn't noticed because most setups actually use the same *value* for for both the NN and 2NN principals ({{hdfs/_HOST@REALM}}), in which the {{_HOST}} part gets replaced at run-time. This bug therefore only manifests itself on secure setups which explicitly specify the NN and 2NN principals. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HDFS-2264) NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo annotation
[ https://issues.apache.org/jira/browse/HDFS-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443610#comment-13443610 ] Aaron T. Myers edited comment on HDFS-2264 at 8/29/12 9:45 AM: --- Hey Jitendra, sorry for forgetting about this JIRA for so long (almost exactly a year!) I just encountered this issue again in a user's cluster. My new thinking is that we should just remove the expected client principal from the NamenodeProtocol entirely. I think this makes sense since the 2NN, SBN, BN, and balancer all potentially use this interface, so there's no single client principal that could reasonably be expected. The balancer, in particular, should be able to be run from any node, even one not running a daemon at all. I think to do what I propose here all we have to do is remove the clientPrincipal parameter from the SecurityInfo annotation on the NamenodeProtocol, and make sure that all of the methods exposed by this interface definitely check for super user privileges. I think most of them do, but we should ensure that they all do. How does this sound to you? was (Author: atm): Hey Jitendra, sorry for forgetting about this JIRA for so long (almost exactly a year!) I just encountered this issue again in a user's cluster. My new thinking is that we should just remove the expected client principal from the NamenodeProtocol entirely. I think this makes sense the 2NN, SBN, BN, and balancer all potentially use this interface, so there's no single client principal that could reasonably be expected. The balancer, in particular, should be able to be run from any node, even one not running a daemon at all. I think to do what I propose here all we have to do is remove the clientPrincipal parameter from the SecurityInfo annotation on the NamenodeProtocol, and make sure that all of the methods exposed by this interface definitely check for super user privileges. I think most of them do, but we should ensure that they all do. How does this sound to you? NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo annotation --- Key: HDFS-2264 URL: https://issues.apache.org/jira/browse/HDFS-2264 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Aaron T. Myers Assignee: Harsh J Fix For: 0.24.0 Attachments: HDFS-2264.r1.diff The {{@KerberosInfo}} annotation specifies the expected server and client principals for a given protocol in order to look up the correct principal name from the config. The {{NamenodeProtocol}} has the wrong value for the client config key. This wasn't noticed because most setups actually use the same *value* for for both the NN and 2NN principals ({{hdfs/_HOST@REALM}}), in which the {{_HOST}} part gets replaced at run-time. This bug therefore only manifests itself on secure setups which explicitly specify the NN and 2NN principals. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.
[ https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443624#comment-13443624 ] Hudson commented on HDFS-3849: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2682 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2682/]) HDFS-3849. When re-loading the FSImage, we should clear the existing genStamp and leases. Contributed by Colin Patrick McCabe. (Revision 1378364) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378364 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystem.java When re-loading the FSImage, we should clear the existing genStamp and leases. -- Key: HDFS-3849 URL: https://issues.apache.org/jira/browse/HDFS-3849 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Critical Fix For: 2.2.0-alpha Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, HDFS-3849.003.patch When re-loading the FSImage, we should clear the existing genStamp and leases. This is an issue in the 2NN, because it sometimes clears the existing FSImage and reloads a new one in order to get back in sync with the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HDFS-3466: Attachment: hdfs-3466-b1-2.patch Here's a patch that incorporates Eli's feedback. The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HDFS-3466: Attachment: hdfs-3466-trunk.patch The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HDFS-3466: Attachment: (was: hdfs-3466-trunk.patch) The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HDFS-3466: Attachment: (was: hdfs-3466-trunk.patch) The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HDFS-3466: Attachment: hdfs-3466-trunk-2.patch The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443645#comment-13443645 ] Hadoop QA commented on HDFS-3466: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542858/hdfs-3466-trunk-2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3115//console This message is automatically generated. The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443654#comment-13443654 ] Hadoop QA commented on HDFS-3466: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542858/hdfs-3466-trunk-2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3116//console This message is automatically generated. The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443657#comment-13443657 ] Hadoop QA commented on HDFS-3864: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542840/HDFS-3864.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3113//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3113//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3113//console This message is automatically generated. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3855) Replace hardcoded strings with the already defined config keys in DataNode.java
[ https://issues.apache.org/jira/browse/HDFS-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-3855: - Description: Replace hardcoded strings with the already defined config keys in DataNode.java Replace hardcoded strings with the already defined config keys in DataNode.java Key: HDFS-3855 URL: https://issues.apache.org/jira/browse/HDFS-3855 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 1.2.0 Reporter: Brandon Li Assignee: Brandon Li Priority: Trivial Attachments: HDFS-3855.branch-1.patch Replace hardcoded strings with the already defined config keys in DataNode.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3135) Build a war file for HttpFS instead of packaging the server (tomcat) along with the application.
[ https://issues.apache.org/jira/browse/HDFS-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443710#comment-13443710 ] Ryan Hennig commented on HDFS-3135: --- I'm troubleshooting a broken build that fails on the Tomcat download, because our Jenkins server doesn't have internet access (by design). Rather, all components are supposed to be fetched from our internal Maven Repository (Artifactory). So while I don't need the war file change, I do think this direct download should be removed. Build a war file for HttpFS instead of packaging the server (tomcat) along with the application. Key: HDFS-3135 URL: https://issues.apache.org/jira/browse/HDFS-3135 Project: Hadoop HDFS Issue Type: Improvement Components: build Affects Versions: 0.23.2 Reporter: Ravi Prakash Labels: build There are several reason why web applications should not be packaged along with the server that is expected to serve them. For one not all organisations use vanilla tomcat. There are other reasons I won't go into. I'm filing this bug because some of our builds failed in trying to download the tomcat.tar.gz file. We then had to manually wget the file and place it in downloads/ to make the build pass. I suspect the download failed because of an overloaded server (Frankly, I don't really know). If someone has ideas, please share them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443712#comment-13443712 ] Hadoop QA commented on HDFS-3864: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542846/HDFS-3864.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken org.apache.hadoop.hdfs.web.TestWebHDFS org.apache.hadoop.hdfs.server.datanode.TestBPOfferService +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3114//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3114//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3114//console This message is automatically generated. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443715#comment-13443715 ] Aaron T. Myers commented on HDFS-3864: -- The findbugs warning is unrelated and I'm confident that the test failures are unrelated as well. I'm going to commit this patch momentarily. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3864: - Resolution: Fixed Fix Version/s: 2.2.0-alpha Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've just committed this to trunk and branch-2. Thanks a lot for the review, Todd. NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Fix For: 2.2.0-alpha Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443737#comment-13443737 ] Hudson commented on HDFS-3864: -- Integrated in Hadoop-Hdfs-trunk-Commit #2717 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2717/]) HDFS-3864. NN does not update internal file mtime for OP_CLOSE when reading from the edit log. Contributed by Aaron T. Myers. (Revision 1378413) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378413 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestModTime.java NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Fix For: 2.2.0-alpha Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443738#comment-13443738 ] Hudson commented on HDFS-3864: -- Integrated in Hadoop-Common-trunk-Commit #2654 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2654/]) HDFS-3864. NN does not update internal file mtime for OP_CLOSE when reading from the edit log. Contributed by Aaron T. Myers. (Revision 1378413) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378413 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestModTime.java NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Fix For: 2.2.0-alpha Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log
[ https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443752#comment-13443752 ] Hudson commented on HDFS-3864: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2683 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2683/]) HDFS-3864. NN does not update internal file mtime for OP_CLOSE when reading from the edit log. Contributed by Aaron T. Myers. (Revision 1378413) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378413 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestModTime.java NN does not update internal file mtime for OP_CLOSE when reading from the edit log -- Key: HDFS-3864 URL: https://issues.apache.org/jira/browse/HDFS-3864 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Aaron T. Myers Assignee: Aaron T. Myers Fix For: 2.2.0-alpha Attachments: HDFS-3864.patch, HDFS-3864.patch When logging an OP_CLOSE to the edit log, the NN writes out an updated file mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN does not apply these values to the in-memory FS data structure. Because of this, a file's mtime or atime may appear to go back in time after an NN restart, or an HA failover. Most of the time this will be harmless and folks won't notice, but in the event one of these files is being used in the distributed cache of an MR job when an HA failover occurs, the job might notice that the mtime of a cache file has changed, which in MR2 will cause the job to fail with an exception like the following: {noformat} java.io.IOException: Resource hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar changed on src filesystem (expected 1342137814599, was 1342137814473 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} Credit to Sujay Rau for discovering this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1490) TransferFSImage should timeout
[ https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443791#comment-13443791 ] Vinay commented on HDFS-1490: - {quote}Why not introduce a new config which defaults to something like 1 minute?{quote} Ok, agree. Will introduce new config for this. {quote}In the test case, shouldn't you somehow notify the servlet to exit? Currently it waits on itself, but nothing notifies it.{quote} That was just added make the client call get timeout. Ideally while stopping the server, that will be interrupted. Anyway I will add a timeout for that also. Thanks todd, for comments. I will post new patch in sometime. TransferFSImage should timeout -- Key: HDFS-1490 URL: https://issues.apache.org/jira/browse/HDFS-1490 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Dmytro Molkov Assignee: Dmytro Molkov Priority: Minor Attachments: HDFS-1490.patch, HDFS-1490.patch Sometimes when primary crashes during image transfer secondary namenode would hang trying to read the image from HTTP connection forever. It would be great to set timeouts on the connection so if something like that happens there is no need to restart the secondary itself. In our case restarting components is handled by the set of scripts and since the Secondary as the process is running it would just stay hung until we get an alarm saying the checkpointing doesn't happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file
[ https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443810#comment-13443810 ] Eli Collins commented on HDFS-3466: --- Hey Owen, I think you meant to remove the 2nd initialization of httpKeytab. {code} +String httpKeytab = conf.get( + DFSConfigKeys.DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY); +if (httpKeytab == null) { + httpKeytab = conf.get(DFSConfigKeys.DFS_NAMENODE_KEYTAB_FILE_KEY); +} String httpKeytab = conf .get(DFSConfigKeys.DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY); {code} The SPNEGO filter for the NameNode should come out of the web keytab file - Key: HDFS-3466 URL: https://issues.apache.org/jira/browse/HDFS-3466 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 1.1.0, 2.0.0-alpha Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, hdfs-3466-trunk-2.patch Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3865) TestDistCp is @ignored
[ https://issues.apache.org/jira/browse/HDFS-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443812#comment-13443812 ] Eli Collins commented on HDFS-3865: --- Looks like some of the tests are commented out as well (eg testUniformSizeDistCp). TestDistCp is @ignored -- Key: HDFS-3865 URL: https://issues.apache.org/jira/browse/HDFS-3865 Project: Hadoop HDFS Issue Type: Test Components: tools Affects Versions: 2.2.0-alpha Reporter: Colin Patrick McCabe Priority: Minor We should fix TestDistCp so that it actually runs, rather than being ignored. {code} @ignore public class TestDistCp { private static final Log LOG = LogFactory.getLog(TestDistCp.class); private static ListPath pathList = new ArrayListPath(); ... {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-282) Serialize ipcPort in DatanodeID instead of DatanodeRegistration and DatanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins resolved HDFS-282. -- Resolution: Not A Problem No longer an issue now that the writable methods have been removed. Serialize ipcPort in DatanodeID instead of DatanodeRegistration and DatanodeInfo Key: HDFS-282 URL: https://issues.apache.org/jira/browse/HDFS-282 Project: Hadoop HDFS Issue Type: Improvement Reporter: Tsz Wo (Nicholas), SZE The field DatanodeID.ipcPort is currently serialized in DatanodeRegistration and DatanodeInfo. Once HADOOP-2797 (remove the codes for handling old layout ) is committed, DatanodeID.ipcPort should be serialized in DatanodeID. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning
[ https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443827#comment-13443827 ] Eli Collins commented on HDFS-3837: --- I investigated some more and confirmed findbugs isn't searching back far enough for the common subclass. Eg if I swap variables in the equals call I get: {noformat} org.apache.hadoop.hdfs.protocol.DatanodeInfo.equals(Object) used to determine equality org.apache.hadoop.hdfs.server.common.JspHelper$NodeRecord.equals(Object) used to determine equality org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.equals(Object) used to determine equality At DataNode.java:[line 1871] {noformat} It stops at DatanodeDescriptor#equals even though this calls super.equals (DatanodeInfo) which calls super.equals (DatanodeID). Just like the current warning stops at DatanodeRegistration#equals which calls super.equals (DatanodeID). It would be better (and findbugs wouldn't choke) if the various classes that extend DatanodeID have a member instead. I looked at this for HDFS-3237 and it required a ton of changes that probably aren't worth it. Given this I'll update the patch per your suggestion Surresh to ignore the warning in DataNode#recoverBlock. Fix DataNode.recoverBlock findbugs warning -- Key: HDFS-3837 URL: https://issues.apache.org/jira/browse/HDFS-3837 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Eli Collins Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt HDFS-2686 introduced the following findbugs warning: {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) {noformat} Both are using DatanodeID#equals but it's a different method because DNR#equals overrides equals for some reason (doesn't change behavior). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning
[ https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-3837: -- Attachment: hdfs-3837.txt Updated patch attached. Fix DataNode.recoverBlock findbugs warning -- Key: HDFS-3837 URL: https://issues.apache.org/jira/browse/HDFS-3837 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Eli Collins Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt HDFS-2686 introduced the following findbugs warning: {noformat} Call to equals() comparing different types in org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock) {noformat} Both are using DatanodeID#equals but it's a different method because DNR#equals overrides equals for some reason (doesn't change behavior). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira