[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3860: Attachment: HDFS-heartbeat-testcase.patch HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3860: Attachment: HDFS-3860.patch HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3703) Decrease the datanode failure detection time
[ https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3703: Attachment: HDFS-3703.patch This patch handles the stale nodes for reading in a straight way. It adds two configuration parameters to indicate whether to detect stale nodes and the time interval for treating nodes as stale nodes. And the DatanodeManager#sortLocatedBlocks method will check if the datanodes are stale and move possible stale nodes to the end of the list. Decrease the datanode failure detection time Key: HDFS-3703 URL: https://issues.apache.org/jira/browse/HDFS-3703 Project: Hadoop HDFS Issue Type: Improvement Components: data-node, name-node Affects Versions: 1.0.3, 2.0.0-alpha Reporter: nkeywal Assignee: Suresh Srinivas Attachments: HDFS-3703.patch By default, if a box dies, the datanode will be marked as dead by the namenode after 10:30 minutes. In the meantime, this datanode will still be proposed by the nanenode to write blocks or to read replicas. It happens as well if the datanode crashes: there is no shutdown hooks to tell the nanemode we're not there anymore. It especially an issue with HBase. HBase regionserver timeout for production is often 30s. So with these configs, when a box dies HBase starts to recover after 30s and, while 10 minutes, the namenode will consider the blocks on the same box as available. Beyond the write errors, this will trigger a lot of missed reads: - during the recovery, HBase needs to read the blocks used on the dead box (the ones in the 'HBase Write-Ahead-Log') - after the recovery, reading these data blocks (the 'HBase region') will fail 33% of the time with the default number of replica, slowering the data access, especially when the errors are socket timeout (i.e. around 60s most of the time). Globally, it would be ideal if HDFS settings could be under HBase settings. As a side note, HBase relies on ZooKeeper to detect regionservers issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
[ https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443292#comment-13443292 ] Jing Zhao commented on HDFS-3860: - I just checked all the invocation of namesystem#writelock / writeunlock, and did not find similar problems. I will check other similar code too. HeartbeatManager#Monitor may wrongly hold the writelock of namesystem - Key: HDFS-3860 URL: https://issues.apache.org/jira/browse/HDFS-3860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.2.0-alpha Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the monitor thread will acquire the write lock of namesystem, and recheck the safemode. If it is in safemode, the monitor thread will return from the heartbeatCheck function without release the write lock. This may cause the monitor thread wrongly holding the write lock forever. The attached test case tries to simulate this bad scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3887) Remove redundant chooseTarget methods in BlockPlacementPolicy.java
Jing Zhao created HDFS-3887: --- Summary: Remove redundant chooseTarget methods in BlockPlacementPolicy.java Key: HDFS-3887 URL: https://issues.apache.org/jira/browse/HDFS-3887 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Trivial BlockPlacementPolicy.java contains multiple chooseTarget() methods with different parameter lists. It is difficult to follow and understand the code since some chooseTarget methods only have minor differences and some of them are only invoked by testing code. In this patch, I try to remove some of the chooseTarget methods and only keep three of them: two abstract methods and the third one using BlockCollection as its parameter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3887) Remove redundant chooseTarget methods in BlockPlacementPolicy.java
[ https://issues.apache.org/jira/browse/HDFS-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3887: Attachment: HDFS-3887.patch Remove redundant chooseTarget methods in BlockPlacementPolicy.java -- Key: HDFS-3887 URL: https://issues.apache.org/jira/browse/HDFS-3887 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Trivial Attachments: HDFS-3887.patch BlockPlacementPolicy.java contains multiple chooseTarget() methods with different parameter lists. It is difficult to follow and understand the code since some chooseTarget methods only have minor differences and some of them are only invoked by testing code. In this patch, I try to remove some of the chooseTarget methods and only keep three of them: two abstract methods and the third one using BlockCollection as its parameter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3887) Remove redundant chooseTarget methods in BlockPlacementPolicy.java
[ https://issues.apache.org/jira/browse/HDFS-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447521#comment-13447521 ] Jing Zhao commented on HDFS-3887: - Have rerun the two failed tests in my local machine and both tests passed. Remove redundant chooseTarget methods in BlockPlacementPolicy.java -- Key: HDFS-3887 URL: https://issues.apache.org/jira/browse/HDFS-3887 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Trivial Attachments: HDFS-3887.patch BlockPlacementPolicy.java contains multiple chooseTarget() methods with different parameter lists. It is difficult to follow and understand the code since some chooseTarget methods only have minor differences and some of them are only invoked by testing code. In this patch, I try to remove some of the chooseTarget methods and only keep three of them: two abstract methods and the third one using BlockCollection as its parameter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3888) BlockPlacementPolicyDefault#LOG should be removed
Jing Zhao created HDFS-3888: --- Summary: BlockPlacementPolicyDefault#LOG should be removed Key: HDFS-3888 URL: https://issues.apache.org/jira/browse/HDFS-3888 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the base class BlockPlacementPolicy. Also, in BlockPlacementPolicyDefault#chooseTarget method, the logic computing the maxTargetPerLoc can be made a separate method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault#LOG should be removed
[ https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3888: Attachment: HDFS-3888.patch BlockPlacementPolicyDefault#LOG should be removed - Key: HDFS-3888 URL: https://issues.apache.org/jira/browse/HDFS-3888 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-3888.patch BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the base class BlockPlacementPolicy. Also, in BlockPlacementPolicyDefault#chooseTarget method, the logic computing the maxTargetPerLoc can be made a separate method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault#LOG should be removed
[ https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3888: Status: Patch Available (was: Open) BlockPlacementPolicyDefault#LOG should be removed - Key: HDFS-3888 URL: https://issues.apache.org/jira/browse/HDFS-3888 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-3888.patch BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the base class BlockPlacementPolicy. Also, in BlockPlacementPolicyDefault#chooseTarget method, the logic computing the maxTargetPerLoc can be made a separate method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault code cleanup
[ https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3888: Status: Open (was: Patch Available) BlockPlacementPolicyDefault code cleanup Key: HDFS-3888 URL: https://issues.apache.org/jira/browse/HDFS-3888 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-3888.patch BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the base class BlockPlacementPolicy. Also, in BlockPlacementPolicyDefault#chooseTarget method, the logic computing the maxTargetPerLoc can be made a separate method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault code cleanup
[ https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-3888: Attachment: HDFS-3888.patch The code for computing the maxNodePerRack in chooseTarget() method is putting back because it may also change the value of numOfReplicas. BlockPlacementPolicyDefault code cleanup Key: HDFS-3888 URL: https://issues.apache.org/jira/browse/HDFS-3888 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-3888.patch, HDFS-3888.patch BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the base class BlockPlacementPolicy. Also, in BlockPlacementPolicyDefault#chooseTarget method, the logic computing the maxTargetPerLoc can be made a separate method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2656) Implement a pure c client based on webhdfs
[ https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao reassigned HDFS-2656: --- Assignee: Jing Zhao Implement a pure c client based on webhdfs -- Key: HDFS-2656 URL: https://issues.apache.org/jira/browse/HDFS-2656 Project: Hadoop HDFS Issue Type: Improvement Components: webhdfs Reporter: Zhanwei.Wang Assignee: Jing Zhao Attachments: HDFS-2656.patch, HDFS-2656.patch, HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png Currently, the implementation of libhdfs is based on JNI. The overhead of JVM seems a little big, and libhdfs can also not be used in the environment without hdfs. It seems a good idea to implement a pure c client by wrapping webhdfs. It also can be used to access different version of hdfs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2656) Implement a pure c client based on webhdfs
[ https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-2656: Affects Version/s: 3.0.0 Status: Patch Available (was: Open) Implement a pure c client based on webhdfs -- Key: HDFS-2656 URL: https://issues.apache.org/jira/browse/HDFS-2656 Project: Hadoop HDFS Issue Type: Improvement Components: webhdfs Affects Versions: 3.0.0 Reporter: Zhanwei.Wang Assignee: Jing Zhao Attachments: HDFS-2656.patch, HDFS-2656.patch, HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png Currently, the implementation of libhdfs is based on JNI. The overhead of JVM seems a little big, and libhdfs can also not be used in the environment without hdfs. It seems a good idea to implement a pure c client by wrapping webhdfs. It also can be used to access different version of hdfs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-6041) Downgrade/Finalize should rename the rollback image instead of purging it
[ https://issues.apache.org/jira/browse/HDFS-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6041: Attachment: HDFS-6041.000.patch Downgrade/Finalize should rename the rollback image instead of purging it - Key: HDFS-6041 URL: https://issues.apache.org/jira/browse/HDFS-6041 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6041.000.patch After we do rollingupgrade downgrade/finalize, instead of purging the rollback image, we'd better rename it back to normal image, since the rollback image can be the most recent checkpoint. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.001.patch Update bkjournal to fix compilation errors. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918912#comment-13918912 ] Jing Zhao commented on HDFS-6038: - One issue with the current patch is that the JN will also check the layoutversion locally while serving read requests. Let me see if we can bypass this check in JN. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6041) Downgrade/Finalize should rename the rollback image instead of purging it
[ https://issues.apache.org/jira/browse/HDFS-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6041: Attachment: HDFS-6041.001.patch Thanks for the review, Nicholas! Update the patch to address your comments. Downgrade/Finalize should rename the rollback image instead of purging it - Key: HDFS-6041 URL: https://issues.apache.org/jira/browse/HDFS-6041 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6041.000.patch, HDFS-6041.001.patch After we do rollingupgrade downgrade/finalize, instead of purging the rollback image, we'd better rename it back to normal image, since the rollback image can be the most recent checkpoint. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HDFS-6041) Downgrade/Finalize should rename the rollback image instead of purging it
[ https://issues.apache.org/jira/browse/HDFS-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-6041. - Resolution: Fixed Thanks Nicholas! I've committed this. Downgrade/Finalize should rename the rollback image instead of purging it - Key: HDFS-6041 URL: https://issues.apache.org/jira/browse/HDFS-6041 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6041.000.patch, HDFS-6041.001.patch After we do rollingupgrade downgrade/finalize, instead of purging the rollback image, we'd better rename it back to normal image, since the rollback image can be the most recent checkpoint. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2
Jing Zhao created HDFS-6053: --- Summary: Fix TestDecommissioningStatus and TestDecommission in branch-2 Key: HDFS-6053 URL: https://issues.apache.org/jira/browse/HDFS-6053 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jing Zhao Assignee: Jing Zhao The failure is caused by the backport of HDFS-5285. In BlockManager#isReplicationInProgress, if (bc instanceof MutableBlockCollection) should be replaced by if (bc.isUnderConstruction()). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2
[ https://issues.apache.org/jira/browse/HDFS-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6053: Attachment: HDFS-6053.000.patch A simple patch to fix. Fix TestDecommissioningStatus and TestDecommission in branch-2 -- Key: HDFS-6053 URL: https://issues.apache.org/jira/browse/HDFS-6053 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6053.000.patch The failure is caused by the backport of HDFS-5285. In BlockManager#isReplicationInProgress, if (bc instanceof MutableBlockCollection) should be replaced by if (bc.isUnderConstruction()). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2
[ https://issues.apache.org/jira/browse/HDFS-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919809#comment-13919809 ] Jing Zhao commented on HDFS-6053: - In my local test the patch can fix the two failed unit tests, and with the change the code is consistent with trunk. Fix TestDecommissioningStatus and TestDecommission in branch-2 -- Key: HDFS-6053 URL: https://issues.apache.org/jira/browse/HDFS-6053 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6053.000.patch The failure is caused by the backport of HDFS-5285. In BlockManager#isReplicationInProgress, if (bc instanceof MutableBlockCollection) should be replaced by if (bc.isUnderConstruction()). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2
[ https://issues.apache.org/jira/browse/HDFS-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-6053. - Resolution: Fixed Fix Version/s: 2.4.0 Thanks for the review, Nicholas! I've committed this to branch-2 and branch-2.4.0. Fix TestDecommissioningStatus and TestDecommission in branch-2 -- Key: HDFS-6053 URL: https://issues.apache.org/jira/browse/HDFS-6053 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.4.0 Attachments: HDFS-6053.000.patch The failure is caused by the backport of HDFS-5285. In BlockManager#isReplicationInProgress, if (bc instanceof MutableBlockCollection) should be replaced by if (bc.isUnderConstruction()). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6043) Give HDFS daemons NFS3 and Portmap their own OPTS
[ https://issues.apache.org/jira/browse/HDFS-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920148#comment-13920148 ] Jing Zhao commented on HDFS-6043: - nit: export HADOOP_NFS3_OPTS= $HADOOP_NFS3_OPTS there is an extra space before $. Other than that +1. Give HDFS daemons NFS3 and Portmap their own OPTS - Key: HDFS-6043 URL: https://issues.apache.org/jira/browse/HDFS-6043 Project: Hadoop HDFS Issue Type: Improvement Components: nfs Reporter: Brandon Li Assignee: Brandon Li Attachments: HDFS-6043.001.patch Like some other HDFS services, the OPTS makes it easier for the users to update resource related settings for the NFS gateway. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6044) Add property for setting the NFS look up time for users
[ https://issues.apache.org/jira/browse/HDFS-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920163#comment-13920163 ] Jing Zhao commented on HDFS-6044: - The patch looks good to me. Some minor comments: # We may want to define the constant string somewhere in the following code? {code} +timeout = conf.getLong(hadoop.nfs3.userupdate.ms, TIMEOUT_DEFAULT); {code} # Can we declare timeout as final and initialize it in the two constructors? Other than that +1. Add property for setting the NFS look up time for users --- Key: HDFS-6044 URL: https://issues.apache.org/jira/browse/HDFS-6044 Project: Hadoop HDFS Issue Type: Improvement Components: nfs Reporter: Brandon Li Assignee: Brandon Li Priority: Minor Attachments: HDFS-6044.001.patch, HDFS-6044.002.patch Currently NFS gateway refresh the user account every 15 minutes. Add a property to make it tunable in different environments. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-5167) Add metrics about the NameNode retry cache
[ https://issues.apache.org/jira/browse/HDFS-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921142#comment-13921142 ] Jing Zhao commented on HDFS-5167: - +1 for the latest patch. I will commit it soon. Add metrics about the NameNode retry cache -- Key: HDFS-5167 URL: https://issues.apache.org/jira/browse/HDFS-5167 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, namenode Affects Versions: 3.0.0, 2.3.0, 2.4.0 Reporter: Jing Zhao Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: HDFS-5167.1.patch, HDFS-5167.10.patch, HDFS-5167.11.patch, HDFS-5167.12.patch, HDFS-5167.2.patch, HDFS-5167.3.patch, HDFS-5167.4.patch, HDFS-5167.5.patch, HDFS-5167.6.patch, HDFS-5167.6.patch, HDFS-5167.7.patch, HDFS-5167.8.patch, HDFS-5167.9-2.patch, HDFS-5167.9.patch It will be helpful to have metrics in NameNode about the retry cache, such as the retry count etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-5167) Add metrics about the NameNode retry cache
[ https://issues.apache.org/jira/browse/HDFS-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-5167: Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks for the great work, [~ozawa]! I've committed this to trunk, branch-2 and branch-2.4.0. Add metrics about the NameNode retry cache -- Key: HDFS-5167 URL: https://issues.apache.org/jira/browse/HDFS-5167 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, namenode Affects Versions: 3.0.0, 2.3.0, 2.4.0 Reporter: Jing Zhao Assignee: Tsuyoshi OZAWA Priority: Minor Fix For: 2.4.0 Attachments: HDFS-5167.1.patch, HDFS-5167.10.patch, HDFS-5167.11.patch, HDFS-5167.12.patch, HDFS-5167.2.patch, HDFS-5167.3.patch, HDFS-5167.4.patch, HDFS-5167.5.patch, HDFS-5167.6.patch, HDFS-5167.6.patch, HDFS-5167.7.patch, HDFS-5167.8.patch, HDFS-5167.9-2.patch, HDFS-5167.9.patch It will be helpful to have metrics in NameNode about the retry cache, such as the retry count etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6058) Fix TestHDFSCLI failures after HADOOP-8691 change
[ https://issues.apache.org/jira/browse/HDFS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6058: Affects Version/s: 2.4.0 Fix TestHDFSCLI failures after HADOOP-8691 change - Key: HDFS-6058 URL: https://issues.apache.org/jira/browse/HDFS-6058 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Vinayakumar B Assignee: Akira AJISAKA Attachments: HDFS-6058.000.patch, HDFS-6058.patch HADOOP-8691 changed the ls command output. TestHDFSCLI needs to be updated after this change, Latest precommit builds are failing because of this. https://builds.apache.org/job/PreCommit-HDFS-Build/6305//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6058) Fix TestHDFSCLI failures after HADOOP-8691 change
[ https://issues.apache.org/jira/browse/HDFS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6058: Status: Patch Available (was: Open) Fix TestHDFSCLI failures after HADOOP-8691 change - Key: HDFS-6058 URL: https://issues.apache.org/jira/browse/HDFS-6058 Project: Hadoop HDFS Issue Type: Bug Reporter: Vinayakumar B Assignee: Akira AJISAKA Attachments: HDFS-6058.000.patch, HDFS-6058.patch HADOOP-8691 changed the ls command output. TestHDFSCLI needs to be updated after this change, Latest precommit builds are failing because of this. https://builds.apache.org/job/PreCommit-HDFS-Build/6305//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-5653) Log namenode hostname in various exceptions being thrown in a HA setup
[ https://issues.apache.org/jira/browse/HDFS-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921565#comment-13921565 ] Jing Zhao commented on HDFS-5653: - +1 the new Patch looks good to me. Log namenode hostname in various exceptions being thrown in a HA setup -- Key: HDFS-5653 URL: https://issues.apache.org/jira/browse/HDFS-5653 Project: Hadoop HDFS Issue Type: Improvement Components: ha Affects Versions: 2.2.0 Reporter: Arpit Gupta Assignee: Haohui Mai Priority: Minor Attachments: HDFS-5653.000.patch, HDFS-5653.001.patch, HDFS-5653.002.patch, HDFS-5653.003.patch, HDFS-5653.004.patch, HDFS-5653.005.patch, HDFS-5653.006.patch In a HA setup any time we see an exception such as safemode or namenode in standby etc we dont know which namenode it came from. The user has to go to the logs of the namenode and determine which one was active and/or standby around the same time. I think it would help with debugging if any such exceptions could include the namenode hostname so the user could know exactly which namenode served the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.002.patch JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921602#comment-13921602 ] Jing Zhao commented on HDFS-6038: - After offline discussion with [~szetszwo], the 002 patch did the following change: 1. Persist a length field for each editlog op to indicate the total length of the op 2. In JournalNode, instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, we add a new method scanEditLog which does not decode each editlog op. Instead, the new method reads the length and txid of each op, and use the length to quickly jump to the next op. The 002 patch is just a preliminary patch to demonstrate the idea. Still need to fix unit tests and run system tests. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Status: Patch Available (was: Open) JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Status: Open (was: Patch Available) JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Status: Patch Available (was: Open) JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.003.patch editsStored Fix some unit tests. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6062) TestRetryCacheWithHA#testConcat is flaky
[ https://issues.apache.org/jira/browse/HDFS-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6062: Attachment: HDFS-6062.000.patch The patch checks the length of the target file to make sure concat has been processed in NN0. TestRetryCacheWithHA#testConcat is flaky Key: HDFS-6062 URL: https://issues.apache.org/jira/browse/HDFS-6062 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-6062.000.patch After adding retry cache metrics check, TestRetryCacheWithHA#testConcat can fail (https://builds.apache.org/job/PreCommit-HDFS-Build/6313//testReport/). The reason is that the test uses dfs.exists(targetPath) to check whether concat has been done in the original active NN. However, since we create the target file in the beginning, the check always returns true. Thus it is possible that the concat is processed in the new active NN for the first time. And in this case the retry cache will not be hit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6062) TestRetryCacheWithHA#testConcat is flaky
[ https://issues.apache.org/jira/browse/HDFS-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6062: Affects Version/s: 2.4.0 Status: Patch Available (was: Open) TestRetryCacheWithHA#testConcat is flaky Key: HDFS-6062 URL: https://issues.apache.org/jira/browse/HDFS-6062 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-6062.000.patch After adding retry cache metrics check, TestRetryCacheWithHA#testConcat can fail (https://builds.apache.org/job/PreCommit-HDFS-Build/6313//testReport/). The reason is that the test uses dfs.exists(targetPath) to check whether concat has been done in the original active NN. However, since we create the target file in the beginning, the check always returns true. Thus it is possible that the concat is processed in the new active NN for the first time. And in this case the retry cache will not be hit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6062) TestRetryCacheWithHA#testConcat is flaky
Jing Zhao created HDFS-6062: --- Summary: TestRetryCacheWithHA#testConcat is flaky Key: HDFS-6062 URL: https://issues.apache.org/jira/browse/HDFS-6062 Project: Hadoop HDFS Issue Type: Bug Reporter: Jing Zhao Assignee: Jing Zhao Priority: Minor Attachments: HDFS-6062.000.patch After adding retry cache metrics check, TestRetryCacheWithHA#testConcat can fail (https://builds.apache.org/job/PreCommit-HDFS-Build/6313//testReport/). The reason is that the test uses dfs.exists(targetPath) to check whether concat has been done in the original active NN. However, since we create the target file in the beginning, the check always returns true. Thus it is possible that the concat is processed in the new active NN for the first time. And in this case the retry cache will not be hit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.004.patch Thanks for the review, Nicholas! Update the patch to address your comments. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6058) Fix TestHDFSCLI failures after HADOOP-8691 change
[ https://issues.apache.org/jira/browse/HDFS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922122#comment-13922122 ] Jing Zhao commented on HDFS-6058: - The failed test should be unrelated. +1 for the 000 patch. Fix TestHDFSCLI failures after HADOOP-8691 change - Key: HDFS-6058 URL: https://issues.apache.org/jira/browse/HDFS-6058 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Vinayakumar B Assignee: Akira AJISAKA Attachments: HDFS-6058.000.patch, HDFS-6058.patch HADOOP-8691 changed the ls command output. TestHDFSCLI needs to be updated after this change, Latest precommit builds are failing because of this. https://builds.apache.org/job/PreCommit-HDFS-Build/6305//testReport/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.005.patch Update the patch to fix some unit tests and address Nicholas's comments. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: (was: editsStored) JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923479#comment-13923479 ] Jing Zhao commented on HDFS-6038: - the editsStored binary file needs to be updated again. Will do it in the end. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923725#comment-13923725 ] Jing Zhao commented on HDFS-6038: - bq. DataOutputBuffer.writeInt can be simplified as below. Here we cannot call DataOutputBuffer#writeInt. The issue is that DataOutputStream#writeInt will increase the total number of written bytes by 4 (DataOutputStream#written), and the total number of written bytes will later be retrieved by EditsDoubleBuffer#countReadyBytes, and used by QuorumOutputStream#flushAndSync to determine the size of the data flushed to JNs. Since our writeInt(int, int) method is actually modifying some previous data, the total number of bytes written should be unchanged. Directly calling DataOutputBuffer#writeInt there will append extra 4 bytes (0x) for each editlog transaction recorded in JNs. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.006.patch Update the patch to fix the javadoc warning and unit test failure. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.007.patch Added an extra check for writing length. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, ha, hdfs-client, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with un kown host exception
[ https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924442#comment-13924442 ] Jing Zhao commented on HDFS-6077: - This is actually a similar issue with HDFS-5339: SecurityUtil#buildTokenService tries to resolve the name service id as a host name. Since HDFS-5339 already figures out the token service name for webhdfs filesystem in the initialization, we can simply override the getCanonicalServiceName method and return the tokenServiceName. running slive with webhdfs on secure HA cluster fails with un kown host exception - Key: HDFS-6077 URL: https://issues.apache.org/jira/browse/HDFS-6077 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.3.0 Reporter: Arpit Gupta Assignee: Jing Zhao -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with un kown host exception
[ https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6077: Status: Patch Available (was: Open) running slive with webhdfs on secure HA cluster fails with un kown host exception - Key: HDFS-6077 URL: https://issues.apache.org/jira/browse/HDFS-6077 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.3.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6077.000.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with un kown host exception
[ https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6077: Attachment: HDFS-6077.000.patch A simple patch to fix. running slive with webhdfs on secure HA cluster fails with un kown host exception - Key: HDFS-6077 URL: https://issues.apache.org/jira/browse/HDFS-6077 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.3.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6077.000.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924578#comment-13924578 ] Jing Zhao commented on HDFS-6038: - [~tlipcon], do you also want to take a look at the patch? JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Status: Open (was: Patch Available) JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: editsStored Upload the editStored binary file. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6072) Clean up dead code of FSImage
[ https://issues.apache.org/jira/browse/HDFS-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13926059#comment-13926059 ] Jing Zhao commented on HDFS-6072: - The patch looks good to me. Besides the comments from [~ajisakaa], we may also want to clean the imports of FSNamesystem.java, and remove the AbstractINodeDiff#wirteSnapshot. Clean up dead code of FSImage - Key: HDFS-6072 URL: https://issues.apache.org/jira/browse/HDFS-6072 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haohui Mai Assignee: Haohui Mai Attachments: HDFS-6072.000.patch After HDFS-5698 HDFS store FSImage in protobuf format. The old code of saving the FSImage is now dead, which should be removed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with unkown host exception
[ https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6077: Summary: running slive with webhdfs on secure HA cluster fails with unkown host exception (was: running slive with webhdfs on secure HA cluster fails with un kown host exception) running slive with webhdfs on secure HA cluster fails with unkown host exception Key: HDFS-6077 URL: https://issues.apache.org/jira/browse/HDFS-6077 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.3.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6077.000.patch SliveTest fails with following. See the comment for more logs. {noformat} SliveTest: Unable to run job due to error: java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-2-secure at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:258) at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:299) ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with unkown host exception
[ https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6077: Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed this to trunk, branch-2 and branch-2.4.0. running slive with webhdfs on secure HA cluster fails with unkown host exception Key: HDFS-6077 URL: https://issues.apache.org/jira/browse/HDFS-6077 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.3.0 Reporter: Arpit Gupta Assignee: Jing Zhao Fix For: 2.4.0 Attachments: HDFS-6077.000.patch SliveTest fails with following. See the comment for more logs. {noformat} SliveTest: Unable to run job due to error: java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-2-secure at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:258) at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:299) ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6072) Clean up dead code of FSImage
[ https://issues.apache.org/jira/browse/HDFS-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930973#comment-13930973 ] Jing Zhao commented on HDFS-6072: - +1 for the new patch. Thanks for the cleaning, Haohui! Clean up dead code of FSImage - Key: HDFS-6072 URL: https://issues.apache.org/jira/browse/HDFS-6072 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haohui Mai Assignee: Haohui Mai Attachments: HDFS-6072.000.patch, HDFS-6072.001.patch, HDFS-6072.002.patch After HDFS-5698 HDFS store FSImage in protobuf format. The old code of saving the FSImage is now dead, which should be removed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931045#comment-13931045 ] Jing Zhao commented on HDFS-6089: - Checked the log with Arpit. Looks like the issue is like this: 1. After NN1 got suspended, NN2 started the transition. It first tried to stop the editlog tailer thread. 2. The editlog tailer thread happened to trigger NN1 to roll its editlog right before the transition, and this rpc call got stuck since NN1 was suspended. 3. It took a relatively long time (1min) for the rollEditlog rpc call to receive the connection reset exception. 4. During this time, NN2 waited for the tailer thread to die, and the fsnamesystem lock was held by the stopStandbyService call. 5. haadmin's getServiceState request could not get response (since the lock was held by the transition thread in NN2) and timeout (its default socket timeout is 20s). In summary, it is possible that the rollEditlog rpc call from the standby NN to the active NN in the editlog tailer thread may delay the NN failover. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931048#comment-13931048 ] Jing Zhao commented on HDFS-6089: - Since in active NN we already have a NameNodeEditLogRoller thread triggering the editlog roll, I guess the standby NN doesn't need to trigger the active namenode to roll the editlog. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6089: Attachment: HDFS-6089.000.patch Simple patch to remove the editlog roll from SBN. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6089: Status: Patch Available (was: Open) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6089: Attachment: HDFS-6089.001.patch Fix unit tests. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933577#comment-13933577 ] Jing Zhao commented on HDFS-6038: - Thanks for the review, Todd! bq. just worried that other contributors may want to review this patch as it's actually making an edit log format change, not just a protocol change for the JNs. I will update the jira title and description to make them more clear about the changes. bq. it might be nice to add a QJM test which writes fake ops to a JournalNode Yeah, will update the patch to add the unit test. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Description: In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. was: In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. JournalNode hardcodes NameNodeLayoutVersion in the edit log file Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Summary: Allow JournalNode to handle editlog produced by new release with future layoutversion (was: JournalNode hardcodes NameNodeLayoutVersion in the edit log file) Allow JournalNode to handle editlog produced by new release with future layoutversion - Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Attachment: HDFS-6038.008.patch Update the patch to address Todd's comments. The main change is to add a new unit test in TestJournal. In the new test we writes some editlog that JNs cannot decode, and verifies that the JN can utilize the length field to scan the segment. Allow JournalNode to handle editlog produced by new release with future layoutversion - Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, HDFS-6038.008.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Status: Patch Available (was: Open) Allow JournalNode to handle editlog produced by new release with future layoutversion - Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, HDFS-6038.008.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold
[ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934384#comment-13934384 ] Jing Zhao commented on HDFS-6094: - I can also reproduce the issue on my local machine. Looks like the issue is: 1. After the standby NN restarts, DN1 sends first the incremental block report then the complete block report to SBN. 2. DN2 sends the incremental block report to SBN. This block report will not change the replica number in SBN because the corresponding storage ID has not been added in SBN yet (the storage ID will only be added during the full block report processing). However, the SBN still checks the current live replica number (which is 1 because SBN already received the full block report from DN1) and use the number to update the safe block count. So maybe a simple fix can be: {code} @@ -2277,7 +2277,7 @@ private Block addStoredBlock(final BlockInfo block, if(storedBlock.getBlockUCState() == BlockUCState.COMMITTED numLiveReplicas = minReplication) { storedBlock = completeBlock(bc, storedBlock, false); -} else if (storedBlock.isComplete()) { +} else if (storedBlock.isComplete() added) { // check whether safe replication is reached for the block // only complete blocks are counted towards that // Is no-op if not in safe mode. {code} The same block can be counted twice towards safe mode threshold --- Key: HDFS-6094 URL: https://issues.apache.org/jira/browse/HDFS-6094 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More details to follow in a comment. Exception details: {code} Time elapsed: 12.874 sec FAILURE! java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically in 28 seconds.' at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold
[ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934407#comment-13934407 ] Jing Zhao commented on HDFS-6094: - Another option is to add new storage id even for incremental block report. [~arpitagarwal], what do you think? The same block can be counted twice towards safe mode threshold --- Key: HDFS-6094 URL: https://issues.apache.org/jira/browse/HDFS-6094 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More details to follow in a comment. Exception details: {code} Time elapsed: 12.874 sec FAILURE! java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically in 28 seconds.' at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6094) The same block can be counted twice towards safe mode threshold
[ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6094: Attachment: TestHASafeMode-output.txt Attach the log of the test that reproduced the failure. I injected an exception for each increment of safe block count. The same block can be counted twice towards safe mode threshold --- Key: HDFS-6094 URL: https://issues.apache.org/jira/browse/HDFS-6094 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: TestHASafeMode-output.txt {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More details to follow in a comment. Exception details: {code} Time elapsed: 12.874 sec FAILURE! java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically in 28 seconds.' at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold
[ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934472#comment-13934472 ] Jing Zhao commented on HDFS-6094: - Maybe another issue with the current code is that when an incremental block report comes before the full block report, if the stored block state is COMMITTED, we may increase the safemode total block number while not increase the safe block count. In that case I'm not sure if the NN can get stuck in the safemode. The same block can be counted twice towards safe mode threshold --- Key: HDFS-6094 URL: https://issues.apache.org/jira/browse/HDFS-6094 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: TestHASafeMode-output.txt {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More details to follow in a comment. Exception details: {code} Time elapsed: 12.874 sec FAILURE! java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically in 28 seconds.' at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold
[ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935296#comment-13935296 ] Jing Zhao commented on HDFS-6094: - The patch looks good to me. One question is, currently NN adds info about a new datanode storage only when processing complete block report. Can we also do this for IBR? The same block can be counted twice towards safe mode threshold --- Key: HDFS-6094 URL: https://issues.apache.org/jira/browse/HDFS-6094 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: HDFS-6904.01.patch, TestHASafeMode-output.txt {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More details to follow in a comment. Exception details: {code} Time elapsed: 12.874 sec FAILURE! java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically in 28 seconds.' at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6100) DataNodeWebHdfsMethods does not failover in HA mode
[ https://issues.apache.org/jira/browse/HDFS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935411#comment-13935411 ] Jing Zhao commented on HDFS-6100: - The patch looks pretty good to me. Some minor comments: # In DatanodeWebHdfsMethods, the current patch has some inconsistent field name for the NamenodeAddressParam parameter (nnId, namenodeId, and namenodeRpcAddress). How about just calling them namenode since it can be either NameService ID or NameNode RPC address? # Nit: the following code needs some reformat: {code} tokenServiceName = HAUtil.isHAEnabled(conf, nsId) ? nsId : NetUtils.getHostPortString (rpcServer.getRpcAddress()); {code} # In the new unit test, we can add some extra check about the content of the new created file. Also, maybe we can try to transition the second NN to active first so that the first create call can also hit a failover. # Looks like the patch also fixes the token service name in HA setup for webhdfs. Please update the description of the jira. # Could you also post your system test results (HA, non-HA, secured, insecure setup etc.)? DataNodeWebHdfsMethods does not failover in HA mode --- Key: HDFS-6100 URL: https://issues.apache.org/jira/browse/HDFS-6100 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Haohui Mai Attachments: HDFS-6100.000.patch In {{DataNodeWebHdfsMethods}}, the code creates a {{DFSClient}} to connect to the NN, so that it can access the files in the cluster. {{DataNodeWebHdfsMethods}} relies on the address passed in the URL to locate the NN. Currently the parameter is set by the NN and it is a host-ip pair, which does not support HA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold
[ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938081#comment-13938081 ] Jing Zhao commented on HDFS-6094: - The latest patch looks good to me. +1. The same block can be counted twice towards safe mode threshold --- Key: HDFS-6094 URL: https://issues.apache.org/jira/browse/HDFS-6094 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: HDFS-6094.03.patch, HDFS-6904.01.patch, TestHASafeMode-output.txt {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More details to follow in a comment. Exception details: {code} Time elapsed: 12.874 sec FAILURE! java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically in 28 seconds.' at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493) at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6090) Use MiniDFSCluster.Builder instead of deprecated constructors
[ https://issues.apache.org/jira/browse/HDFS-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938182#comment-13938182 ] Jing Zhao commented on HDFS-6090: - Thanks for the cleanup, Akira! The patch looks good to me. +1 pending Jenkins. Use MiniDFSCluster.Builder instead of deprecated constructors - Key: HDFS-6090 URL: https://issues.apache.org/jira/browse/HDFS-6090 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 2.3.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Minor Labels: newbie Attachments: HDFS-6090.patch Some test classes are using deprecated constructors such as {{MiniDFSCluster(Configuration, int, boolean, String[], String[])}} for building a MiniDFSCluster. These classes should use {{MiniDFSCluster.Builder}} to reduce javac warnings and improve code readability. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6090) Use MiniDFSCluster.Builder instead of deprecated constructors
[ https://issues.apache.org/jira/browse/HDFS-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6090: Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed the patch to trunk, branch-2, and branch-2.4. Thanks [~ajisakaa] for the contribution. Use MiniDFSCluster.Builder instead of deprecated constructors - Key: HDFS-6090 URL: https://issues.apache.org/jira/browse/HDFS-6090 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 2.3.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Minor Labels: newbie Fix For: 2.4.0 Attachments: HDFS-6090.patch Some test classes are using deprecated constructors such as {{MiniDFSCluster(Configuration, int, boolean, String[], String[])}} for building a MiniDFSCluster. These classes should use {{MiniDFSCluster.Builder}} to reduce javac warnings and improve code readability. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6113) Rolling upgrae exception
[ https://issues.apache.org/jira/browse/HDFS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938811#comment-13938811 ] Jing Zhao commented on HDFS-6113: - Hi Fengdong, thanks for testing. But hadoop 2.3 does not support rolling upgrade... And HA upgrade support also starts only from 2.4. Also, please check the document for rolling upgrade detailed steps. Rolling upgrae exception Key: HDFS-6113 URL: https://issues.apache.org/jira/browse/HDFS-6113 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Fengdong Yu I've a hadoop-2.3 running non-securable on the cluster. then I built a trunk instance, also non securable. NN1 - active NN2 - standby DN1 - datanode DN2 - datanode JN1,JN2,JN3 - Journal and ZK then on the NN2: {code} hadoop-dameon.sh stop namenode hadoop-dameon.sh stop zkfc {code} then: change the environment variables to the new hadoop.(trunk version) then: {code} hadoop-dameon.sh start namenode {code} NN2 throws exception: {code} org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not journal CTime for one more JournalNodes. 1 exceptions thrown: 10.100.91.33:8485: Failed on local exception: java.io.EOFException; Host Details : local host is: 10-204-8-136/10.204.8.136; destination host is: jn33.com:8485; at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.getJournalCTime(QuorumJournalManager.java:631) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getSharedLogCTime(FSEditLog.java:1383) at org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:738) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:600) at org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:360) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:894) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:653) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:444) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:500) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:656) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:641) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1294) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1360) {code} JN throws Exception: {code} 2014-03-18 12:19:01,960 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: readAndProcess threw exception java.io.IOException: Unable to read authentication method from client 10.204.8.136. Count of bytes read: 0 java.io.IOException: Unable to read authentication method at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1344) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:761) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:560) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:535) 2014-03-18 12:19:01,960 DEBUG org.apache.hadoop.ipc.Server: IPC Server listener on 8485: disconnecting client 10.204.8.136:39063. Number of active connections: 1 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (HDFS-6113) Rolling upgrae exception
[ https://issues.apache.org/jira/browse/HDFS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938811#comment-13938811 ] Jing Zhao edited comment on HDFS-6113 at 3/18/14 4:49 AM: -- Hi Fengdong, thanks for testing. But hadoop 2.3 does not support rolling upgrade... And HA upgrade support only starts from 2.4. Also, please check the document for rolling upgrade detailed steps. was (Author: jingzhao): Hi Fengdong, thanks for testing. But hadoop 2.3 does not support rolling upgrade... And HA upgrade support also starts only from 2.4. Also, please check the document for rolling upgrade detailed steps. Rolling upgrae exception Key: HDFS-6113 URL: https://issues.apache.org/jira/browse/HDFS-6113 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Fengdong Yu I've a hadoop-2.3 running non-securable on the cluster. then I built a trunk instance, also non securable. NN1 - active NN2 - standby DN1 - datanode DN2 - datanode JN1,JN2,JN3 - Journal and ZK then on the NN2: {code} hadoop-dameon.sh stop namenode hadoop-dameon.sh stop zkfc {code} then: change the environment variables to the new hadoop.(trunk version) then: {code} hadoop-dameon.sh start namenode {code} NN2 throws exception: {code} org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not journal CTime for one more JournalNodes. 1 exceptions thrown: 10.100.91.33:8485: Failed on local exception: java.io.EOFException; Host Details : local host is: 10-204-8-136/10.204.8.136; destination host is: jn33.com:8485; at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.getJournalCTime(QuorumJournalManager.java:631) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getSharedLogCTime(FSEditLog.java:1383) at org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:738) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:600) at org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:360) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:894) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:653) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:444) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:500) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:656) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:641) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1294) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1360) {code} JN throws Exception: {code} 2014-03-18 12:19:01,960 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: readAndProcess threw exception java.io.IOException: Unable to read authentication method from client 10.204.8.136. Count of bytes read: 0 java.io.IOException: Unable to read authentication method at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1344) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:761) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:560) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:535) 2014-03-18 12:19:01,960 DEBUG org.apache.hadoop.ipc.Server: IPC Server listener on 8485: disconnecting client 10.204.8.136:39063. Number of active connections: 1 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938931#comment-13938931 ] Jing Zhao commented on HDFS-6089: - Thanks for the comments, Andrew and Todd! bq. In EditLogTailer#doTailEdits, I believe that rolling the edit log right before is intended to freshen up the edit log for consumption by the SbNN. But in the currently code, the auto trigger is still running periodically, which means we cannot guarantee that we roll the editlog before we call doTailEdits. During the failover, we call editLog.recoverUnclosedStreams() and EditLogTailer#catchupDuringFailover in FSNamesystem#startActiveServices to guarantee the SBN can tail all the editlog. But before failover, if we can make the autoroller on the active NN more aggressive (as you suggested), we can still guarantee that the SBN will not do a lot of replay on a failover. What do you think? bq. we'll need to update its check period and thresholds to be more aggressive. Yes, agree. We should assign a smaller value to the sleep interval (maybe 2min just like the SBN). bq. Maybe we should just have a shorter timeout on the rollEditLog call. Or somehow.. We can also do this. But to have two auto roller working in two NN at the same time still seems not that necessary to me.. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HDFS-6113) Rolling upgrae exception
[ https://issues.apache.org/jira/browse/HDFS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-6113. - Resolution: Invalid Based on Kihwal and Nicholas's comments, let's close this jira first. Fengdong, thanks for the testing, and please feel free to open new jiras if you think there are other issues. Rolling upgrae exception Key: HDFS-6113 URL: https://issues.apache.org/jira/browse/HDFS-6113 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Fengdong Yu I've a hadoop-2.3 running non-securable on the cluster. then I built a trunk instance, also non securable. NN1 - active NN2 - standby DN1 - datanode DN2 - datanode JN1,JN2,JN3 - Journal and ZK then on the NN2: {code} hadoop-dameon.sh stop namenode hadoop-dameon.sh stop zkfc {code} then: change the environment variables to the new hadoop.(trunk version) then: {code} hadoop-dameon.sh start namenode {code} NN2 throws exception: {code} org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not journal CTime for one more JournalNodes. 1 exceptions thrown: 10.100.91.33:8485: Failed on local exception: java.io.EOFException; Host Details : local host is: 10-204-8-136/10.204.8.136; destination host is: jn33.com:8485; at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.getJournalCTime(QuorumJournalManager.java:631) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getSharedLogCTime(FSEditLog.java:1383) at org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:738) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:600) at org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:360) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:894) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:653) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:444) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:500) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:656) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:641) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1294) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1360) {code} JN throws Exception: {code} 2014-03-18 12:19:01,960 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: readAndProcess threw exception java.io.IOException: Unable to read authentication method from client 10.204.8.136. Count of bytes read: 0 java.io.IOException: Unable to read authentication method at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1344) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:761) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:560) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:535) 2014-03-18 12:19:01,960 DEBUG org.apache.hadoop.ipc.Server: IPC Server listener on 8485: disconnecting client 10.204.8.136:39063. Number of active connections: 1 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939848#comment-13939848 ] Jing Zhao commented on HDFS-6089: - Thanks for the response, Andrew. bq. If we add a time threshold (like the tailer), we want to avoid the reverse problem: a lot of small segments accumulating in the absence of a standby. Could you please explain how we avoid this issue with the current strategy? For the autoroller in ANN, I guess it should still determine whether to roll based on the # edits, however, we should change its sleeping interval from 5min to a smaller number (e.g., 2min), which means it will come to check the edits # every 2min and roll edits if necessary. Can this address your concern? Or am I missing something here? Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940229#comment-13940229 ] Jing Zhao commented on HDFS-6089: - Hi Andrew, thanks for the explanation. I guess I understand your concern now: only rolling on ANN based on edits # may cause issue in some scenario. This is because if we don't have further operations it is possible that SBN will wait a long time to tail that part of edits which is in an in-progress segment. bq. Checkpointing combines the edit log with the fsimage, and we purge unnecessary log segments afterwards. But I'm still a little confused about this part. I fail to see the difference of the based-on-time rolling from SBN and ANN. In the current code, SBN triggers rolling still through RPC to ANN. Also this does not affect checkpointing and purging: when SBN does a checkpoint, both SBN and ANN will purge old edits in their own storage (SBN does the purging before uploading the checkpoint, and ANN does it after getting the new fsimage). So I guess a possible solution may be: just letting ANN does rolling every 2min. I think this can achieve almost the same effect as the current mechanism, without delaying the failover. Or you see some counter examples with this change? Back to the changing the rpc timeout solution. Looks like we have not set timeout for this NN--NN rpc right now (correct me if I'm wrong). Setting a timeout (e.g., 20s just like the default timeout from client to NN) of course can improve the failover time in our test case, but I still prefer the above solution because it makes the rolling behavior simpler and more predictable (especially it removes the rpc call from SBN to ANN). Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6100) DataNodeWebHdfsMethods does not failover in HA mode
[ https://issues.apache.org/jira/browse/HDFS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6100: Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed the patch to trunk, branch-2, and branch-2.4. Thanks [~wheat9] for the contribution. DataNodeWebHdfsMethods does not failover in HA mode --- Key: HDFS-6100 URL: https://issues.apache.org/jira/browse/HDFS-6100 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Haohui Mai Fix For: 2.4.0 Attachments: HDFS-6100.000.patch, HDFS-6100.001.patch In {{DataNodeWebHdfsMethods}}, the code creates a {{DFSClient}} to connect to the NN, so that it can access the files in the cluster. {{DataNodeWebHdfsMethods}} relies on the address passed in the URL to locate the NN. This implementation has two problems: # The {{DFSClient}} only knows about the current active NN, thus it does not support failover. # The delegation token is based on the active NN, therefore the {{DFSClient}} will fail to authenticate of the standby NN in secure HA setup. Currently the parameter {{namenoderpcaddress}} in the URL stores the host-ip pair that corresponds to the active NN. To fix this bug, this jira proposes to store the name service id in the parameter in HA setup (yet the parameter stays the same in non-HA setup). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6100) DataNodeWebHdfsMethods does not failover in HA mode
[ https://issues.apache.org/jira/browse/HDFS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940716#comment-13940716 ] Jing Zhao commented on HDFS-6100: - +1 for the latest patch. DataNodeWebHdfsMethods does not failover in HA mode --- Key: HDFS-6100 URL: https://issues.apache.org/jira/browse/HDFS-6100 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Haohui Mai Attachments: HDFS-6100.000.patch, HDFS-6100.001.patch In {{DataNodeWebHdfsMethods}}, the code creates a {{DFSClient}} to connect to the NN, so that it can access the files in the cluster. {{DataNodeWebHdfsMethods}} relies on the address passed in the URL to locate the NN. This implementation has two problems: # The {{DFSClient}} only knows about the current active NN, thus it does not support failover. # The delegation token is based on the active NN, therefore the {{DFSClient}} will fail to authenticate of the standby NN in secure HA setup. Currently the parameter {{namenoderpcaddress}} in the URL stores the host-ip pair that corresponds to the active NN. To fix this bug, this jira proposes to store the name service id in the parameter in HA setup (yet the parameter stays the same in non-HA setup). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6105) NN web UI for DN list loads the same jmx page multiple times.
[ https://issues.apache.org/jira/browse/HDFS-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941088#comment-13941088 ] Jing Zhao commented on HDFS-6105: - +1 NN web UI for DN list loads the same jmx page multiple times. - Key: HDFS-6105 URL: https://issues.apache.org/jira/browse/HDFS-6105 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.3.0 Reporter: Kihwal Lee Assignee: Haohui Mai Attachments: HDFS-6105.000.patch, datanodes-tab.png When loading Datanodes page of the NN web UI, the same jmx query is made multiple times. For a big cluster, that's a lot of data and overhead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941117#comment-13941117 ] Jing Zhao commented on HDFS-6089: - bq. This is because if we don't have further operations it is possible that SBN will wait a long time to tail that part of edits which is in an in-progress segment. bq. In this scenario, the ANN will keep rolling every 2mins, generating a lot of edit log segments that aren't being cleared out. Hmm, actually my thought yesterday was not correct. Yes, we cannot do auto rolling simply based on time, and the reason is just like [~andrew.wang] pointed out. Hopefully this is my last question, just want to make sure: the current SBN auto roller can cause the same issue like a lot of edit log segments aren't being cleared out in case that checkpoints are broken (but the SBN is not down), right? Anyway I will post a patch to add rpc timeout later. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6127) WebHDFS tokens cannot be renewed in HA setup
[ https://issues.apache.org/jira/browse/HDFS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941149#comment-13941149 ] Jing Zhao commented on HDFS-6127: - The patch looks good to me. Some comments: # Nit: HAUtil#getServiceUriFromToken and TokenManager#getInstance need format. # javadoc of HAUtil#getServiceUriFromToken needs to be updated after the change: the method now can support URI of other FS, not just HDFS. # The change in TestWebhdfsForHA actually weakens our unit test, I think. We still need the fs.renew and fs.cancel for regression of HDFS-5339. A separate unit test with some code copy should be fine here, I guess. {code} - fs.renewDelegationToken(token); - fs.cancelDelegationToken(token); + token.renew(conf); + token.cancel(conf); {code} +1 after addressing the comments. WebHDFS tokens cannot be renewed in HA setup Key: HDFS-6127 URL: https://issues.apache.org/jira/browse/HDFS-6127 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Haohui Mai Attachments: HDFS-6127.000.patch {{TokenAspect}} class assumes that the service name of the token is alway a host-ip pair. In HA setup, however, the service name becomes the name service id, which breaks the assumption. As a result, WebHDFS tokens cannot be renewed in HA setup. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6127) WebHDFS tokens cannot be renewed in HA setup
[ https://issues.apache.org/jira/browse/HDFS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941171#comment-13941171 ] Jing Zhao commented on HDFS-6127: - bq. The change in TestWebhdfsForHA actually weakens our unit test, I think Actually this will not, since in the end the fs.renew and fs.cancel will still be called. Thus please just ignore this comment. WebHDFS tokens cannot be renewed in HA setup Key: HDFS-6127 URL: https://issues.apache.org/jira/browse/HDFS-6127 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Haohui Mai Attachments: HDFS-6127.000.patch {{TokenAspect}} class assumes that the service name of the token is alway a host-ip pair. In HA setup, however, the service name becomes the name service id, which breaks the assumption. As a result, WebHDFS tokens cannot be renewed in HA setup. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
[ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6089: Attachment: HDFS-6089.002.patch Patch that adds rpc timeout for the rollEditLog call. I set the default timeout to 20s. Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended Key: HDFS-6089 URL: https://issues.apache.org/jira/browse/HDFS-6089 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jing Zhao Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch, HDFS-6089.002.patch The following scenario was tested: * Determine Active NN and suspend the process (kill -19) * Wait about 60s to let the standby transition to active * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6120) Cleanup safe mode error messages
[ https://issues.apache.org/jira/browse/HDFS-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942052#comment-13942052 ] Jing Zhao commented on HDFS-6120: - Thanks for the cleanup, Arpit! The patch looks good to me. Some minors: # I guess here we should use if (needEnter()) ? {code} + if (!needEnter()) { +reportStatus(STATE* Safe mode ON, thresholds not met., false); + } {code} # For getTurnOffTip(), do you think we need an extra msg to report the value of reached and the corresponding status (safe mode off, or on, or in extension)? After the NN comes into the safemode extension, the safe block count may still fall below the threshold. Cleanup safe mode error messages Key: HDFS-6120 URL: https://issues.apache.org/jira/browse/HDFS-6120 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.3.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: HDFS-6120.01.patch In HA mode the SBN can enter safe-mode extension and stay there even after the extension period has elapsed but continue to return the safemode message stating that The threshold has been reached and safe mode will be turned off soon. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao moved HADOOP-10415 to HDFS-6131: -- Component/s: (was: documentation) documentation Affects Version/s: (was: 2.3.0) 2.3.0 Key: HDFS-6131 (was: HADOOP-10415) Project: Hadoop HDFS (was: Hadoop Common) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6131: Attachment: HDFS-6131.000.patch Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6131.000.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6131: Status: Open (was: Patch Available) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6131.000.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942128#comment-13942128 ] Jing Zhao commented on HDFS-6131: - Upload the patch for branch-2. Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6131.000.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6131: Status: Patch Available (was: Open) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6131.000.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942212#comment-13942212 ] Jing Zhao commented on HDFS-6131: - Ahh, yes. Thanks Arpit! Let me update the patch. Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6131.000.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6131: Attachment: HDFS-6131.001.patch Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-6131.000.patch, HDFS-6131.001.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
[ https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-6131. - Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Thanks very much to Arpit for the review! I've committed this to branch-2 and branch-2.4.0. Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS Key: HDFS-6131 URL: https://issues.apache.org/jira/browse/HDFS-6131 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.3.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 2.4.0 Attachments: HDFS-6131.000.patch, HDFS-6131.001.patch Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should move them to HDFS just like in trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942461#comment-13942461 ] Jing Zhao commented on HDFS-6038: - Thanks for the review, Nicholas! I will commit the patch shortly. Allow JournalNode to handle editlog produced by new release with future layoutversion - Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, HDFS-6038.008.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion
[ https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6038: Resolution: Fixed Fix Version/s: 2.4.0 Status: Resolved (was: Patch Available) I've committed the patch to trunk, branch-2, and branch-2.4. Thanks Nicholas and Todd for the review! Allow JournalNode to handle editlog produced by new release with future layoutversion - Key: HDFS-6038 URL: https://issues.apache.org/jira/browse/HDFS-6038 Project: Hadoop HDFS Issue Type: Sub-task Components: journal-node, namenode Reporter: Haohui Mai Assignee: Jing Zhao Fix For: 2.4.0 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, HDFS-6038.008.patch, editsStored In HA setup, the JNs receive edit logs (blob) from the NN and write into edit log files. In order to write well-formed edit log files, the JNs prepend a header for each edit log file. The problem is that the JN hard-codes the version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during rolling upgrade. In the meanwhile, currently JN tries to decode the in-progress editlog segment in order to know the last txid in the segment. In the rolling upgrade scenario, the JN with the old software may not be able to correctly decode the editlog generated by the new software. This jira makes the following changes to allow JN to handle editlog produced by software with future layoutversion: 1. Change the NN--JN startLogSegment RPC signature and let NN specify the layoutversion for the new editlog segment. 2. Persist a length field for each editlog op to indicate the total length of the op. Instead of calling EditLogFileInputStream#validateEditLog to get the last txid of an in-progress editlog segment, a new method scanEditLog is added and used by JN which does not decode each editlog op but uses the length to quickly jump to the next op. -- This message was sent by Atlassian JIRA (v6.2#6252)