[jira] [Commented] (HDFS-1952) FSEditLog.open() appears to succeed even if all EDITS directories fail
[ https://issues.apache.org/jira/browse/HDFS-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037746#comment-13037746 ] Matt Foley commented on HDFS-1952: -- +1 on the v22 version. Confirmed compiles successfully, and new unit test fails before the FSEditLog change, passes after. FSEditLog.open() appears to succeed even if all EDITS directories fail -- Key: HDFS-1952 URL: https://issues.apache.org/jira/browse/HDFS-1952 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.22.0, 0.23.0 Reporter: Matt Foley Assignee: Andrew Wang Labels: newbie Attachments: hdfs-1952-0.22.patch, hdfs-1952.patch, hdfs-1952.patch, hdfs-1952.patch FSEditLog.open() appears to succeed even if all of the individual directories failed to allow creation of an EditLogOutputStream. The problem and solution are essentially similar to that of HDFS-1505. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1727) fsck command can display command usage if user passes any illegal argument
[ https://issues.apache.org/jira/browse/HDFS-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-1727: -- Description: In fsck command if user passes the arguments like ./hadoop fsck -test -files -blocks -racks In this case it will take / and will display whole DFS information regarding to files,blocks,racks. But here, we are hiding the user mistake. Instead of this, we can display the command usage if user passes any invalid argument like above. If user passes illegal optional arguments like ./hadoop fsck /test -listcorruptfileblocks instead of ./hadoop fsck /test -list-corruptfileblocks also we can display the proper command usage was: In fsck command if user passes the arguments like ./hadoop fsck -test -files -blocks -racks In this case it will take / and will display whole DFS information regarding to files,blocks,racks. But here, we are hiding the user mistake. Instead of this, we can display the usage if user passes any invalid argument like above. Assignee: Uma Maheswara Rao G Summary: fsck command can display command usage if user passes any illegal argument (was: fsck command can display usage if user passes any other arguments with '-' ( other than -move, -delete, -files , -openforwrite, -blocks , -locations, -racks).) fsck command can display command usage if user passes any illegal argument -- Key: HDFS-1727 URL: https://issues.apache.org/jira/browse/HDFS-1727 Project: Hadoop HDFS Issue Type: Bug Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G Priority: Minor In fsck command if user passes the arguments like ./hadoop fsck -test -files -blocks -racks In this case it will take / and will display whole DFS information regarding to files,blocks,racks. But here, we are hiding the user mistake. Instead of this, we can display the command usage if user passes any invalid argument like above. If user passes illegal optional arguments like ./hadoop fsck /test -listcorruptfileblocks instead of ./hadoop fsck /test -list-corruptfileblocks also we can display the proper command usage -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing
When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing -- Key: HDFS-1981 URL: https://issues.apache.org/jira/browse/HDFS-1981 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Environment: Linux Reporter: ramkrishna.s.vasudevan Fix For: 0.23.0 This scenario is applicable in NN and BNN case. When the namenode goes down after creating the edits.new, on subsequent restart the divertFileStreams will not happen to edits.new as the edits.new file is already present and the size is zero. so on trying to saveCheckPoint an exception occurs 2011-05-23 16:38:57,476 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: Namenode has an edit log with timestamp of 2011-05-23 16:38:56 but new checkpoint was created using editlog with timestamp 2011-05-23 16:37:30. Checkpoint Aborted. This is a bug or is that the behaviour. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1978) All but first option in LIBHDFS_OPTS is ignored
[ https://issues.apache.org/jira/browse/HDFS-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037912#comment-13037912 ] Hudson commented on HDFS-1978: -- Integrated in Hadoop-Hdfs-trunk #675 (See [https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/675/]) HDFS-1978. All but first option in LIBHDFS_OPTS is ignored. Contributed by Eli Collins eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1126312 Files : * /hadoop/hdfs/trunk/src/c++/libhdfs/hdfsJniHelper.c * /hadoop/hdfs/trunk/CHANGES.txt All but first option in LIBHDFS_OPTS is ignored --- Key: HDFS-1978 URL: https://issues.apache.org/jira/browse/HDFS-1978 Project: Hadoop HDFS Issue Type: Bug Components: libhdfs Affects Versions: 0.21.0 Environment: RHEL 5.5 JDK 1.6.0_24 Reporter: Brock Noland Assignee: Eli Collins Fix For: 0.22.0 Attachments: HDFS-1978.0.patch, hdfs-1978-1.patch In getJNIEnv, we go though LIBHDFS_OPTS with strok and count the number of args. Then create an array of options based on that information. But when we actually setup the options we only the first arg. I believe the fix is pasted inline. {noformat} Index: src/c++/libhdfs/hdfsJniHelper.c === --- src/c++/libhdfs/hdfsJniHelper.c (revision 1124544) +++ src/c++/libhdfs/hdfsJniHelper.c (working copy) @@ -442,6 +442,7 @@ int argNum = 1; for (;argNum noArgs ; argNum++) { options[argNum].optionString = result; //optHadoopArg; +result = strtok( NULL, jvmArgDelims); } } {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1982) Null pointer exception is thrown when NN restarts with a block lesser in size than the block that is present in DN1 but the generation stamp is greater in the NN
Null pointer exception is thrown when NN restarts with a block lesser in size than the block that is present in DN1 but the generation stamp is greater in the NN -- Key: HDFS-1982 URL: https://issues.apache.org/jira/browse/HDFS-1982 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20-append Environment: Linux Reporter: ramkrishna.s.vasudevan Fix For: 0.20-append Conisder the following scenario. WE have a cluster with one NN and 2 DN. We write some file. One of the block is written in DN1 but not yet completed in DN2 local disk. Now DN1 gets killed and so pipeline recovery happens for the block with the size as in DN2 but the generation stamp gets updated in the NN. DN2 also gets killed. Now restart NN and DN1 Now if NN restarts, the block that NN has greater time stamp but the size is lesser in the NN. This leads to Null pointer exception in addstoredblock api -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing
[ https://issues.apache.org/jira/browse/HDFS-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037949#comment-13037949 ] Todd Lipcon commented on HDFS-1981: --- Hi Ramkrishna. Can you provide a unit test which shows this issue? It would be especially good to see such a test against 0.22, since HDFS-1073 will restructure all this code when it's merged into 0.23. When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing -- Key: HDFS-1981 URL: https://issues.apache.org/jira/browse/HDFS-1981 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Environment: Linux Reporter: ramkrishna.s.vasudevan Fix For: 0.23.0 This scenario is applicable in NN and BNN case. When the namenode goes down after creating the edits.new, on subsequent restart the divertFileStreams will not happen to edits.new as the edits.new file is already present and the size is zero. so on trying to saveCheckPoint an exception occurs 2011-05-23 16:38:57,476 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: Namenode has an edit log with timestamp of 2011-05-23 16:38:56 but new checkpoint was created using editlog with timestamp 2011-05-23 16:37:30. Checkpoint Aborted. This is a bug or is that the behaviour. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.
[ https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HDFS-1950: - Attachment: HDFS-1950-2.patch Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly. Key: HDFS-1950 URL: https://issues.apache.org/jira/browse/HDFS-1950 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: ramkrishna.s.vasudevan Fix For: 0.20-append Attachments: HDFS-1950-2.patch Before going to the root cause lets see the read behavior for a file having more than 10 blocks in append case.. Logic: There is prefetch size dfs.read.prefetch.size for the DFSInputStream which has default value of 10 This prefetch size is the number of blocks that the client will fetch from the namenode for reading a file.. For example lets assume that a file X having 22 blocks is residing in HDFS The reader first fetches first 10 blocks from the namenode and start reading After the above step , the reader fetches the next 10 blocks from NN and continue reading Then the reader fetches the remaining 2 blocks from NN and complete the write Cause: === Lets see the cause for this issue now... Scenario that will fail is Writer wrote 10+ blocks and a partial block and called sync. Reader trying to read the file will not get the last partial block . Client first gets the 10 block locations from the NN. Now it checks whether the file is under construction and if so it gets the size of the last partial block from datanode and reads the full file However when the number of blocks is more than 10, the last block will not be in the first fetch. It will be in the second or other blocks(last block will be in (num of blocks / 10)th fetch) The problem now is, in DFSClient there is no logic to get the size of the last partial block(as in case of point 1), for the rest of the fetches other than first fetch, the reader will not be able to read the complete data synced...!! also the InputStream.available api uses the first fetched block size to iterate. Ideally this size has to be increased -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..
[ https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HDFS-1949: - Attachment: hdfs-1949.patch Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value.. - Key: HDFS-1949 URL: https://issues.apache.org/jira/browse/HDFS-1949 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20-append, 0.21.0, 0.23.0 Reporter: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.23.0 Attachments: HDFS-1949.patch, hdfs-1949.patch In the Namenode UI we have a text box to enter the chunk size. The expected value for the chunk size is a valid Integer value. If any invalid value, string or empty spaces are provided it throws number format exception. The existing behaviour is like we need to consider the default value if no value is specified. Soln We can handle numberformat exception and assign default value if invalid value is specified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.
[ https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HDFS-1950: - Status: Patch Available (was: Open) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly. Key: HDFS-1950 URL: https://issues.apache.org/jira/browse/HDFS-1950 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: ramkrishna.s.vasudevan Fix For: 0.20-append Attachments: HDFS-1950-2.patch Before going to the root cause lets see the read behavior for a file having more than 10 blocks in append case.. Logic: There is prefetch size dfs.read.prefetch.size for the DFSInputStream which has default value of 10 This prefetch size is the number of blocks that the client will fetch from the namenode for reading a file.. For example lets assume that a file X having 22 blocks is residing in HDFS The reader first fetches first 10 blocks from the namenode and start reading After the above step , the reader fetches the next 10 blocks from NN and continue reading Then the reader fetches the remaining 2 blocks from NN and complete the write Cause: === Lets see the cause for this issue now... Scenario that will fail is Writer wrote 10+ blocks and a partial block and called sync. Reader trying to read the file will not get the last partial block . Client first gets the 10 block locations from the NN. Now it checks whether the file is under construction and if so it gets the size of the last partial block from datanode and reads the full file However when the number of blocks is more than 10, the last block will not be in the first fetch. It will be in the second or other blocks(last block will be in (num of blocks / 10)th fetch) The problem now is, in DFSClient there is no logic to get the size of the last partial block(as in case of point 1), for the rest of the fetches other than first fetch, the reader will not be able to read the complete data synced...!! also the InputStream.available api uses the first fetched block size to iterate. Ideally this size has to be increased -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..
[ https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HDFS-1949: - Status: Patch Available (was: Open) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value.. - Key: HDFS-1949 URL: https://issues.apache.org/jira/browse/HDFS-1949 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0, 0.20-append, 0.23.0 Reporter: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.23.0 Attachments: HDFS-1949.patch, hdfs-1949.patch In the Namenode UI we have a text box to enter the chunk size. The expected value for the chunk size is a valid Integer value. If any invalid value, string or empty spaces are provided it throws number format exception. The existing behaviour is like we need to consider the default value if no value is specified. Soln We can handle numberformat exception and assign default value if invalid value is specified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..
[ https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HDFS-1949: - Status: Open (was: Patch Available) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value.. - Key: HDFS-1949 URL: https://issues.apache.org/jira/browse/HDFS-1949 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0, 0.20-append, 0.23.0 Reporter: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.23.0 Attachments: HDFS-1949.patch, hdfs-1949.patch In the Namenode UI we have a text box to enter the chunk size. The expected value for the chunk size is a valid Integer value. If any invalid value, string or empty spaces are provided it throws number format exception. The existing behaviour is like we need to consider the default value if no value is specified. Soln We can handle numberformat exception and assign default value if invalid value is specified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.
[ https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037981#comment-13037981 ] Hadoop QA commented on HDFS-1950: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480115/HDFS-1950-2.patch against trunk revision 1126312. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/612//console This message is automatically generated. Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly. Key: HDFS-1950 URL: https://issues.apache.org/jira/browse/HDFS-1950 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: ramkrishna.s.vasudevan Fix For: 0.20-append Attachments: HDFS-1950-2.patch Before going to the root cause lets see the read behavior for a file having more than 10 blocks in append case.. Logic: There is prefetch size dfs.read.prefetch.size for the DFSInputStream which has default value of 10 This prefetch size is the number of blocks that the client will fetch from the namenode for reading a file.. For example lets assume that a file X having 22 blocks is residing in HDFS The reader first fetches first 10 blocks from the namenode and start reading After the above step , the reader fetches the next 10 blocks from NN and continue reading Then the reader fetches the remaining 2 blocks from NN and complete the write Cause: === Lets see the cause for this issue now... Scenario that will fail is Writer wrote 10+ blocks and a partial block and called sync. Reader trying to read the file will not get the last partial block . Client first gets the 10 block locations from the NN. Now it checks whether the file is under construction and if so it gets the size of the last partial block from datanode and reads the full file However when the number of blocks is more than 10, the last block will not be in the first fetch. It will be in the second or other blocks(last block will be in (num of blocks / 10)th fetch) The problem now is, in DFSClient there is no logic to get the size of the last partial block(as in case of point 1), for the rest of the fetches other than first fetch, the reader will not be able to read the complete data synced...!! also the InputStream.available api uses the first fetched block size to iterate. Ideally this size has to be increased -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.
[ https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037985#comment-13037985 ] ramkrishna.s.vasudevan commented on HDFS-1950: -- This patch applies for 0.20 append branch. Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly. Key: HDFS-1950 URL: https://issues.apache.org/jira/browse/HDFS-1950 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: ramkrishna.s.vasudevan Fix For: 0.20-append Attachments: HDFS-1950-2.patch Before going to the root cause lets see the read behavior for a file having more than 10 blocks in append case.. Logic: There is prefetch size dfs.read.prefetch.size for the DFSInputStream which has default value of 10 This prefetch size is the number of blocks that the client will fetch from the namenode for reading a file.. For example lets assume that a file X having 22 blocks is residing in HDFS The reader first fetches first 10 blocks from the namenode and start reading After the above step , the reader fetches the next 10 blocks from NN and continue reading Then the reader fetches the remaining 2 blocks from NN and complete the write Cause: === Lets see the cause for this issue now... Scenario that will fail is Writer wrote 10+ blocks and a partial block and called sync. Reader trying to read the file will not get the last partial block . Client first gets the 10 block locations from the NN. Now it checks whether the file is under construction and if so it gets the size of the last partial block from datanode and reads the full file However when the number of blocks is more than 10, the last block will not be in the first fetch. It will be in the second or other blocks(last block will be in (num of blocks / 10)th fetch) The problem now is, in DFSClient there is no logic to get the size of the last partial block(as in case of point 1), for the rest of the fetches other than first fetch, the reader will not be able to read the complete data synced...!! also the InputStream.available api uses the first fetched block size to iterate. Ideally this size has to be increased -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1787) Not enough xcievers error should propagate to client
[ https://issues.apache.org/jira/browse/HDFS-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037996#comment-13037996 ] Jonathan Hsieh commented on HDFS-1787: -- Actually, I assumed that the test focused on the patch in the 'changes' section of the jenkins result of build 608. This actually ran the newly added test case from the HDFS-1787 patch.. The org.apache.hadoop.hdfs.server.datanode.TestFiDataTransferProtocol2.pipeline_Fi_30 test seems to be intermittently failing. It also isn't reported by hudson. Is there a reason why? Not enough xcievers error should propagate to client -- Key: HDFS-1787 URL: https://issues.apache.org/jira/browse/HDFS-1787 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 0.23.0 Reporter: Todd Lipcon Assignee: Jonathan Hsieh Labels: newbie Fix For: 0.23.0 Attachments: hdfs-1787.2.patch, hdfs-1787.3.patch, hdfs-1787.3.patch, hdfs-1787.patch We find that users often run into the default transceiver limits in the DN. Putting aside the inherent issues with xceiver threads, it would be nice if the xceiver limit exceeded error propagated to the client. Currently, clients simply see an EOFException which is hard to interpret, and have to go slogging through DN logs to find the underlying issue. The data transfer protocol should be extended to either have a special error code for not enough xceivers or should have some error code for generic errors with which a string can be attached and propagated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..
[ https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038005#comment-13038005 ] Hadoop QA commented on HDFS-1949: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480116/hdfs-1949.patch against trunk revision 1126312. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//testReport/ Release audit warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//console This message is automatically generated. Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value.. - Key: HDFS-1949 URL: https://issues.apache.org/jira/browse/HDFS-1949 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20-append, 0.21.0, 0.23.0 Reporter: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.23.0 Attachments: HDFS-1949.patch, hdfs-1949.patch In the Namenode UI we have a text box to enter the chunk size. The expected value for the chunk size is a valid Integer value. If any invalid value, string or empty spaces are provided it throws number format exception. The existing behaviour is like we need to consider the default value if no value is specified. Soln We can handle numberformat exception and assign default value if invalid value is specified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections
[ https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038055#comment-13038055 ] Tsz Wo (Nicholas), SZE commented on HDFS-1965: -- Came up a question: By setting maxidletime to 0, is there a race condition that the timeout occurs before the first call, i.e. the proxy is closed before the first call? IPCs done using block token-based tickets can't reuse connections - Key: HDFS-1965 URL: https://issues.apache.org/jira/browse/HDFS-1965 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.22.0 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, hdfs-1965.txt This is the reason that TestFileConcurrentReaders has been failing a lot. Reproducing a comment from HDFS-1057: The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket. When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-1828) TestBlocksWithNotEnoughRacks intermittently fails assert
[ https://issues.apache.org/jira/browse/HDFS-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE resolved HDFS-1828. -- Resolution: Fixed Assignee: Matt Foley Since we are not reverting the patch, re-close this. If the test is still failing, please create a new issue. TestBlocksWithNotEnoughRacks intermittently fails assert Key: HDFS-1828 URL: https://issues.apache.org/jira/browse/HDFS-1828 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: 0.23.0 Reporter: Matt Foley Assignee: Matt Foley Fix For: 0.23.0 Attachments: TestBlocksWithNotEnoughRacks.java.patch, TestBlocksWithNotEnoughRacks_v2.patch In server.namenode.TestBlocksWithNotEnoughRacks.testSufficientlyReplicatedBlocksWithNotEnoughRacks assert fails at curReplicas == REPLICATION_FACTOR, but it seems that it should go higher initially, and if the test doesn't wait for it to go back down, it will fail false positive. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections
[ https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038072#comment-13038072 ] Todd Lipcon commented on HDFS-1965: --- I think in trunk, it's not possible, since the connection is only lazily opened by the actual RPC to the DataNode. Then, it won't close since there's a call outstanding. In 0.22, it's possible that it will open one connection for the getProtocolVersion() call and a second one for the actual RPC. Unless I'm missing something, that should only be an efficiency issue and not a correctness issue. Do you agree? IPCs done using block token-based tickets can't reuse connections - Key: HDFS-1965 URL: https://issues.apache.org/jira/browse/HDFS-1965 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.22.0 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, hdfs-1965.txt This is the reason that TestFileConcurrentReaders has been failing a lot. Reproducing a comment from HDFS-1057: The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket. When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections
[ https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038076#comment-13038076 ] Tsz Wo (Nicholas), SZE commented on HDFS-1965: -- Okay, I fine with it since it is only a temporary fix. +1 the 0.22 patch looks good. IPCs done using block token-based tickets can't reuse connections - Key: HDFS-1965 URL: https://issues.apache.org/jira/browse/HDFS-1965 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.22.0 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, hdfs-1965.txt This is the reason that TestFileConcurrentReaders has been failing a lot. Reproducing a comment from HDFS-1057: The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket. When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1853) refactor TestNodeCount to import standard node counting and wait for replication methods
[ https://issues.apache.org/jira/browse/HDFS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Foley updated HDFS-1853: - Issue Type: Improvement (was: Sub-task) Parent: (was: HDFS-1852) refactor TestNodeCount to import standard node counting and wait for replication methods -- Key: HDFS-1853 URL: https://issues.apache.org/jira/browse/HDFS-1853 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 0.22.0 Reporter: Matt Foley Eli's suggestions for refactoring the three wait for loops in TestNodeCount for re-usability (similar to what was done for HDFS-1562): You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1853) refactor TestNodeCount to import standard node counting and wait for replication methods
[ https://issues.apache.org/jira/browse/HDFS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038098#comment-13038098 ] Matt Foley commented on HDFS-1853: -- Removed from HDFS-1852 umbrella task, since not related to recurring Hudson test failures. refactor TestNodeCount to import standard node counting and wait for replication methods -- Key: HDFS-1853 URL: https://issues.apache.org/jira/browse/HDFS-1853 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 0.22.0 Reporter: Matt Foley Eli's suggestions for refactoring the three wait for loops in TestNodeCount for re-usability (similar to what was done for HDFS-1562): You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-1853) refactor TestNodeCount to import standard node counting and wait for replication methods
[ https://issues.apache.org/jira/browse/HDFS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Foley reassigned HDFS-1853: Assignee: Matt Foley refactor TestNodeCount to import standard node counting and wait for replication methods -- Key: HDFS-1853 URL: https://issues.apache.org/jira/browse/HDFS-1853 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 0.22.0 Reporter: Matt Foley Assignee: Matt Foley Eli's suggestions for refactoring the three wait for loops in TestNodeCount for re-usability (similar to what was done for HDFS-1562): You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1401) TestFileConcurrentReader test case is still timing out / failing
[ https://issues.apache.org/jira/browse/HDFS-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038112#comment-13038112 ] Matt Foley commented on HDFS-1401: -- As of build 611 (May 23), we see: TestFileConcurrentReader.testUnfinishedBlockCRCErrorTransferToVerySmallWrite failed almost every build through 604, but has passed the last five builds in which auto-test ran. This may be fixed, but needs to still be watched for intermittent failure. TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite and TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransfer failed intermittently through build 600, but have not failed since. However, they are infrequent intermittent, and have been skipping six or eight builds between failures. They remain on the watch list. TestFileConcurrentReader test case is still timing out / failing Key: HDFS-1401 URL: https://issues.apache.org/jira/browse/HDFS-1401 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs client Affects Versions: 0.22.0 Reporter: Tanping Wang Priority: Critical Attachments: HDFS-1401.patch The unit test case, TestFileConcurrentReader after its most recent fix in HDFS-1310 still times out when using java 1.6.0_07. When using java 1.6.0_07, the test case simply hangs. On apache Hudson build ( which possibly is using a higher sub-version of java) this test case has presented an inconsistent test result that it sometimes passes, some times fails. For example, between the most recent build 423, 424 and build 425, there is no effective change, however, the test case failed on build 424 and passed on build 425 build 424 test failed https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/424/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/ build 425 test passed https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/425/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1401) TestFileConcurrentReader test case is still timing out / failing
[ https://issues.apache.org/jira/browse/HDFS-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038119#comment-13038119 ] sam rash commented on HDFS-1401: see todd's find in: https://issues.apache.org/jira/browse/HDFS-1057 TestFileConcurrentReader test case is still timing out / failing Key: HDFS-1401 URL: https://issues.apache.org/jira/browse/HDFS-1401 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs client Affects Versions: 0.22.0 Reporter: Tanping Wang Priority: Critical Attachments: HDFS-1401.patch The unit test case, TestFileConcurrentReader after its most recent fix in HDFS-1310 still times out when using java 1.6.0_07. When using java 1.6.0_07, the test case simply hangs. On apache Hudson build ( which possibly is using a higher sub-version of java) this test case has presented an inconsistent test result that it sometimes passes, some times fails. For example, between the most recent build 423, 424 and build 425, there is no effective change, however, the test case failed on build 424 and passed on build 425 build 424 test failed https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/424/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/ build 425 test passed https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/425/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1852) Umbrella task: Clean up HDFS unit test recurring failures
[ https://issues.apache.org/jira/browse/HDFS-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038135#comment-13038135 ] Matt Foley commented on HDFS-1852: -- Besides the three remaining open issues above, we also have three infrequent-intermittent issues that may still exist. All were last seen in build 594, so it possible they were addressed. org.apache.hadoop.cli.TestHDFSCLI.testAll - v intermittent, last 566, 579, 587, 594 org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery.testErrorReplicas - v intermittent, last 559, 566, 579, 594 org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement.testBlockReplacement - v intermittent, last 565, 578, 594 Recording here for watchlist purposes. Umbrella task: Clean up HDFS unit test recurring failures -- Key: HDFS-1852 URL: https://issues.apache.org/jira/browse/HDFS-1852 Project: Hadoop HDFS Issue Type: Test Components: test Affects Versions: 0.22.0 Reporter: Matt Foley Recurring failures and false positives undermine CI by encouraging developers to ignore unit test failures. Let's clean these up! Some are intermittent due to timing-sensitive conditions. The unit tests for background thread activities (such as block replication and corrupt replica detection) often use wait while or wait until loops to detect results. The quality and robustness of these loops vary widely, and common usages should be moved to DFSTestUtil. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-236) Random read benchmark for DFS
[ https://issues.apache.org/jira/browse/HDFS-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Thompson updated HDFS-236: --- Attachment: RndRead-TestDFSIO.patch I've taken Raghu's patch from 6/27/09 with random read TestDFSIO enhancement, and ported it to the latest (now mapreduce) trunk 5/4/11 svn rev 1099590. Patch attached RndRead-TestDFSIO.patch. enjoy, Dave Random read benchmark for DFS - Key: HDFS-236 URL: https://issues.apache.org/jira/browse/HDFS-236 Project: Hadoop HDFS Issue Type: New Feature Reporter: Raghu Angadi Assignee: Raghu Angadi Attachments: HDFS-236.patch, RndRead-TestDFSIO.patch We should have at least one random read benchmark that can be run with rest of Hadoop benchmarks regularly. Please provide benchmark ideas or requirements. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1983) Fix path display for copy rm
Fix path display for copy rm -- Key: HDFS-1983 URL: https://issues.apache.org/jira/browse/HDFS-1983 Project: Hadoop HDFS Issue Type: Test Components: test Affects Versions: 0.23.0 Reporter: Daryn Sharp Assignee: Daryn Sharp This will also fix a few misc broken tests. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-988) saveNamespace can corrupt edits log
[ https://issues.apache.org/jira/browse/HDFS-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-988: - Status: Open (was: Patch Available) removing patch available status since this still needs to be finished up. saveNamespace can corrupt edits log --- Key: HDFS-988 URL: https://issues.apache.org/jira/browse/HDFS-988 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0, 0.20-append, 0.22.0 Reporter: dhruba borthakur Assignee: Todd Lipcon Priority: Blocker Fix For: 0.20-append, 0.22.0 Attachments: HDFS-988_fix_synchs.patch, hdfs-988-2.patch, hdfs-988.txt, saveNamespace.txt, saveNamespace_20-append.patch The adminstrator puts the namenode is safemode and then issues the savenamespace command. This can corrupt the edits log. The problem is that when the NN enters safemode, there could still be pending logSycs occuring from other threads. Now, the saveNamespace command, when executed, would save a edits log with partial writes. I have seen this happen on 0.20. https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections
[ https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1965: -- Resolution: Fixed Status: Resolved (was: Patch Available) Committed the 22 patch. Thanks, Nicholas. HADOOP-7317 tracks the real underlying issue. IPCs done using block token-based tickets can't reuse connections - Key: HDFS-1965 URL: https://issues.apache.org/jira/browse/HDFS-1965 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.22.0 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, hdfs-1965.txt This is the reason that TestFileConcurrentReaders has been failing a lot. Reproducing a comment from HDFS-1057: The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket. When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-988) saveNamespace can corrupt edits log, apparently due to race conditions
[ https://issues.apache.org/jira/browse/HDFS-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Foley updated HDFS-988: Summary: saveNamespace can corrupt edits log, apparently due to race conditions (was: saveNamespace can corrupt edits log) saveNamespace can corrupt edits log, apparently due to race conditions -- Key: HDFS-988 URL: https://issues.apache.org/jira/browse/HDFS-988 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: dhruba borthakur Assignee: Todd Lipcon Priority: Blocker Fix For: 0.20-append, 0.22.0 Attachments: HDFS-988_fix_synchs.patch, hdfs-988-2.patch, hdfs-988.txt, saveNamespace.txt, saveNamespace_20-append.patch The adminstrator puts the namenode is safemode and then issues the savenamespace command. This can corrupt the edits log. The problem is that when the NN enters safemode, there could still be pending logSycs occuring from other threads. Now, the saveNamespace command, when executed, would save a edits log with partial writes. I have seen this happen on 0.20. https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1603) Namenode gets sticky if one of namenode storage volumes disappears (removed, unmounted, etc.)
[ https://issues.apache.org/jira/browse/HDFS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038245#comment-13038245 ] Todd Lipcon commented on HDFS-1603: --- ATM and I just brainstormed about this a little bit over some iced coffee. Though on the surface it doesn't look too hard to implement timeouts on namedir operations, it would actually have to be done in a lot of places (eg mkdirs/move calls on storage directories, writing edits, saving images, etc). Timing out some of these things isn't entirely straightforward, since the underlying calls aren't interruptible. At some point we could attempt to tackle it, but looks like a complicated project. So, rather than trying to implement this in software for now, it's probably better to just recommend the proper NFS mount options when storing name dirs on NFS. Namenode gets sticky if one of namenode storage volumes disappears (removed, unmounted, etc.) - Key: HDFS-1603 URL: https://issues.apache.org/jira/browse/HDFS-1603 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0 Reporter: Konstantin Boudnik While investigating failures on HDFS-1602 it became apparent that once a namenode storage volume is pulled out NN becomes completely sticky until {{FSImage:processIOError: removing storage}} move the storage from the active set. During this time none of normal NN operations are possible (e.g. creating a directory on HDFS timeouts eventually). In case of NFS this can be workaround'd with soft,intr,timeo,retrans settings. However, a better handling of the situation is apparently possible and needs to be implemented. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1967) TestHDFSTrash failing on trunk and 22
[ https://issues.apache.org/jira/browse/HDFS-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Foley updated HDFS-1967: - Issue Type: Sub-task (was: Bug) Parent: HDFS-1852 TestHDFSTrash failing on trunk and 22 - Key: HDFS-1967 URL: https://issues.apache.org/jira/browse/HDFS-1967 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 0.22.0 Reporter: Todd Lipcon Fix For: 0.22.0 Seems to have started failing recently in many commit builds as well as the last two nightly builds of 22: https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/51/testReport/org.apache.hadoop.hdfs/TestHDFSTrash/testTrashEmptier/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1967) TestHDFSTrash failing on trunk and 22
[ https://issues.apache.org/jira/browse/HDFS-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038276#comment-13038276 ] Matt Foley commented on HDFS-1967: -- TestHDFSTrash.testTrashEmptier() was failing on almost every Hudson build through 605. However, has not failed for the last four auto-test builds. Watch-listing for trunk. However, we'd like to understand what fixed it (if it is fixed) so we can apply the patch to v22 and yahoo-merge branches. TestHDFSTrash failing on trunk and 22 - Key: HDFS-1967 URL: https://issues.apache.org/jira/browse/HDFS-1967 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 0.22.0 Reporter: Todd Lipcon Fix For: 0.22.0 Seems to have started failing recently in many commit builds as well as the last two nightly builds of 22: https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/51/testReport/org.apache.hadoop.hdfs/TestHDFSTrash/testTrashEmptier/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1984) HDFS-1073: Enable multiple checkpointers to run simultaneously
[ https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1984: -- Component/s: name-node Description: One of the motivations of HDFS-1073 is that it decouples the checkpoint process so that multiple checkpoints could be taken at the same time and not interfere with each other. Currently on the 1073 branch this doesn't quite work right, since we have some state and validation in FSImage that's tied to a single fsimage_N -- thus if two 2NNs perform a checkpoint at different transaction IDs, only one will succeed. As a stress test, we can run two 2NNs each configured with the fs.checkpoint.interval set to 0 which causes them to continuously checkpoint as fast as they can. Affects Version/s: Edit log branch (HDFS-1073) Fix Version/s: Edit log branch (HDFS-1073) HDFS-1073: Enable multiple checkpointers to run simultaneously -- Key: HDFS-1984 URL: https://issues.apache.org/jira/browse/HDFS-1984 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) One of the motivations of HDFS-1073 is that it decouples the checkpoint process so that multiple checkpoints could be taken at the same time and not interfere with each other. Currently on the 1073 branch this doesn't quite work right, since we have some state and validation in FSImage that's tied to a single fsimage_N -- thus if two 2NNs perform a checkpoint at different transaction IDs, only one will succeed. As a stress test, we can run two 2NNs each configured with the fs.checkpoint.interval set to 0 which causes them to continuously checkpoint as fast as they can. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1984) HDFS-1073: Enable multiple checkpointers to run simultaneously
[ https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038288#comment-13038288 ] Todd Lipcon commented on HDFS-1984: --- Currently this test scenario fails after a few seconds with an exception like: 11/05/23 15:25:46 WARN mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: Namenode has an edit log corresponding to txid 1240 but new checkpoint was created using editlog ending at txid 1238. Checkpoint Aborted. at org.apache.hadoop.hdfs.server.namenode.FSImage.validateCheckpointUpload(FSImage.java:894) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:107) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:80) but this validation is bogus. So long as no two checkpointers try to upload a checkpoint at the same txid, it's OK if they upload old fsimages. To fix this, I think we need to do the following: a) Repurpose the checkpointTxId field of FSImage. This currently tracks the last txid at which the NN has either saved or uploaded a checkpoint. We use it to advertise which image file a checkpointer should download, but we also use it to validate the checkpoint upload. Instead, it should be renamed to mostRecentImageTxId and only be used to advertise the image. b) Remove the imageDigest field. The function of validation is now being done by an adjacent .md5 file next to each image. When the checkpointer downloads an image, the image transfer servlet can just read the .md5 file and include the hash as an HTTP header. The checkpointer can then verify that it transferred correctly by comparing the image it downloaded against that md5 hash. When uploading the new checkpoint back to the NN, the same process is used in reverse. The new validation rules for accepting a checkpoint upload should be: - the namespace/clusterid/etc match up (same as today) - the transaction ID of the uploaded image is less than the current transaction ID of the namespace (sanity check) - the hash of the file received matches the hash that the 2NN communicates for a header HDFS-1073: Enable multiple checkpointers to run simultaneously -- Key: HDFS-1984 URL: https://issues.apache.org/jira/browse/HDFS-1984 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) One of the motivations of HDFS-1073 is that it decouples the checkpoint process so that multiple checkpoints could be taken at the same time and not interfere with each other. Currently on the 1073 branch this doesn't quite work right, since we have some state and validation in FSImage that's tied to a single fsimage_N -- thus if two 2NNs perform a checkpoint at different transaction IDs, only one will succeed. As a stress test, we can run two 2NNs each configured with the fs.checkpoint.interval set to 0 which causes them to continuously checkpoint as fast as they can. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1963) HDFS rpm integration project
[ https://issues.apache.org/jira/browse/HDFS-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDFS-1963: Attachment: HDFS-1963-3.patch Change configuration directory from $PREFIX/conf to $PREFIX/etc/hadoop per Owen's recommendation. For RPM/deb, it will use /etc/hadoop as default, and create symlink for $PREFIX/etc/hadoop point to /etc/hadoop. HDFS rpm integration project Key: HDFS-1963 URL: https://issues.apache.org/jira/browse/HDFS-1963 Project: Hadoop HDFS Issue Type: New Feature Components: build Environment: Java 6, RHEL 5.5 Reporter: Eric Yang Assignee: Eric Yang Attachments: HDFS-1963-1.patch, HDFS-1963-2.patch, HDFS-1963-3.patch, HDFS-1963.patch This jira is corresponding to HADOOP-6255 and associated directory layout change. The patch for creating HDFS rpm packaging should be posted here for patch test build to verify against hdfs svn trunk. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet
HDFS-1073: Cleanup in image transfer servlet Key: HDFS-1985 URL: https://issues.apache.org/jira/browse/HDFS-1985 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Todd Lipcon Assignee: Todd Lipcon -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet
[ https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1985: -- Component/s: name-node Description: The TransferFsImage class has grown several heads and is somewhat confusing to follow. This JIRA is to refactor it a little bit. - the TransferFsImage class contains static methods to put/get image and edits files. It's used by checkpointing nodes. [the same static methods it has today] - some common code from call sites of TransferFsImage are moved into TransferFsImage itself, so it presents a cleaner interface to checkpointers - the non-static parts of TransferFsImage are moved to an inner class of GetImageServlet called GetImageParams, since they were only responsible for parameter parsing/validation. Affects Version/s: Edit log branch (HDFS-1073) Fix Version/s: Edit log branch (HDFS-1073) HDFS-1073: Cleanup in image transfer servlet Key: HDFS-1985 URL: https://issues.apache.org/jira/browse/HDFS-1985 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) The TransferFsImage class has grown several heads and is somewhat confusing to follow. This JIRA is to refactor it a little bit. - the TransferFsImage class contains static methods to put/get image and edits files. It's used by checkpointing nodes. [the same static methods it has today] - some common code from call sites of TransferFsImage are moved into TransferFsImage itself, so it presents a cleaner interface to checkpointers - the non-static parts of TransferFsImage are moved to an inner class of GetImageServlet called GetImageParams, since they were only responsible for parameter parsing/validation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet
[ https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1985: -- Attachment: hdfs-1985.txt Attached patch does the above refactoring and cleanup. While I was at it, I also made the client code check the HTTP response for 200 OK status. This fixes the client error reporting behavior in the event that the server throws an exception while processing the request. HDFS-1073: Cleanup in image transfer servlet Key: HDFS-1985 URL: https://issues.apache.org/jira/browse/HDFS-1985 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) Attachments: hdfs-1985.txt The TransferFsImage class has grown several heads and is somewhat confusing to follow. This JIRA is to refactor it a little bit. - the TransferFsImage class contains static methods to put/get image and edits files. It's used by checkpointing nodes. [the same static methods it has today] - some common code from call sites of TransferFsImage are moved into TransferFsImage itself, so it presents a cleaner interface to checkpointers - the non-static parts of TransferFsImage are moved to an inner class of GetImageServlet called GetImageParams, since they were only responsible for parameter parsing/validation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1986) Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer()
Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer() Key: HDFS-1986 URL: https://issues.apache.org/jira/browse/HDFS-1986 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 0.23.0 Reporter: Tanping Wang Assignee: Tanping Wang Priority: Minor Fix For: 0.23.0 Currently DFSUtil.getInfoServer gets http port with security off and httpS port with security on. However, we want to return http port regardless of security on/off for Cluster UI to use. Add in a third Boolean parameter for user to decide whether to check security or not. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-422) fuse-dfs leaks FileSystem handles as it never disconnects them because the FileSystem.Cache does not do reference counting
[ https://issues.apache.org/jira/browse/HDFS-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins resolved HDFS-422. -- Resolution: Duplicate fuse-dfs leaks FileSystem handles as it never disconnects them because the FileSystem.Cache does not do reference counting -- Key: HDFS-422 URL: https://issues.apache.org/jira/browse/HDFS-422 Project: Hadoop HDFS Issue Type: Bug Components: contrib/fuse-dfs Reporter: Pete Wyckoff Priority: Minor since users may be doing multiple file operations at the same time, a single task in fuse, can never call close() on a filesystem (ie libhdfs::hdfsDisconnect) because there may be another thread for the same user. as such, either fuse-dfs needs to do reference counting or FileSystem.Cache needs to or maybe enable a mode where one can turn off the Cache?? I currently am not seeing any problems in production, but I am still running 0.18 version which keeps only one connection as root. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1986) Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer()
[ https://issues.apache.org/jira/browse/HDFS-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanping Wang updated HDFS-1986: --- Attachment: HDFS-1986.patch Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer() Key: HDFS-1986 URL: https://issues.apache.org/jira/browse/HDFS-1986 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 0.23.0 Reporter: Tanping Wang Assignee: Tanping Wang Priority: Minor Fix For: 0.23.0 Attachments: HDFS-1986.patch Currently DFSUtil.getInfoServer gets http port with security off and httpS port with security on. However, we want to return http port regardless of security on/off for Cluster UI to use. Add in a third Boolean parameter for user to decide whether to check security or not. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet
[ https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038311#comment-13038311 ] Eli Collins commented on HDFS-1985: --- +1 pending Hudson. This is much nicer. Nit: indent the throws in parseLongParam and downloadImageToStorage HDFS-1073: Cleanup in image transfer servlet Key: HDFS-1985 URL: https://issues.apache.org/jira/browse/HDFS-1985 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) Attachments: hdfs-1985.txt The TransferFsImage class has grown several heads and is somewhat confusing to follow. This JIRA is to refactor it a little bit. - the TransferFsImage class contains static methods to put/get image and edits files. It's used by checkpointing nodes. [the same static methods it has today] - some common code from call sites of TransferFsImage are moved into TransferFsImage itself, so it presents a cleaner interface to checkpointers - the non-static parts of TransferFsImage are moved to an inner class of GetImageServlet called GetImageParams, since they were only responsible for parameter parsing/validation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet
[ https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038341#comment-13038341 ] Todd Lipcon commented on HDFS-1985: --- Will fix indentation nits on commit. This is on the 1073 branch, so Hudson won't run against it. I ran TestCheckpoint which covers this code pretty well, as well as a subset of tests (all those modified in the last 4 commits on the branch). They all passed (except for BN-related ones which are known to be broken at the moment) HDFS-1073: Cleanup in image transfer servlet Key: HDFS-1985 URL: https://issues.apache.org/jira/browse/HDFS-1985 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) Attachments: hdfs-1985.txt The TransferFsImage class has grown several heads and is somewhat confusing to follow. This JIRA is to refactor it a little bit. - the TransferFsImage class contains static methods to put/get image and edits files. It's used by checkpointing nodes. [the same static methods it has today] - some common code from call sites of TransferFsImage are moved into TransferFsImage itself, so it presents a cleaner interface to checkpointers - the non-static parts of TransferFsImage are moved to an inner class of GetImageServlet called GetImageParams, since they were only responsible for parameter parsing/validation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet
[ https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-1985. --- Resolution: Fixed Hadoop Flags: [Reviewed] HDFS-1073: Cleanup in image transfer servlet Key: HDFS-1985 URL: https://issues.apache.org/jira/browse/HDFS-1985 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) Attachments: hdfs-1985.txt The TransferFsImage class has grown several heads and is somewhat confusing to follow. This JIRA is to refactor it a little bit. - the TransferFsImage class contains static methods to put/get image and edits files. It's used by checkpointing nodes. [the same static methods it has today] - some common code from call sites of TransferFsImage are moved into TransferFsImage itself, so it presents a cleaner interface to checkpointers - the non-static parts of TransferFsImage are moved to an inner class of GetImageServlet called GetImageParams, since they were only responsible for parameter parsing/validation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1969) Running rollback on new-version namenode destroys namespace
[ https://issues.apache.org/jira/browse/HDFS-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038345#comment-13038345 ] Konstantin Shvachko commented on HDFS-1969: --- Todd, what I meant is that {{NNStorage.setFields()}} should not contain the if statement {code} if (layoutVersion = -26) { ... } {code} I actually don't see where it is triggered. In general, we can allow these ifs in the loading part of the code, like loadFSImage(). But the saving part should be free of dependencies on the layout version, because there is only one LV - the current one. The precondition sounds good. But it would be better to just convert it to assert. I don't think we've used {{com.google.*}} before at least not in HDFS. Why introduce it now. Running rollback on new-version namenode destroys namespace --- Key: HDFS-1969 URL: https://issues.apache.org/jira/browse/HDFS-1969 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Fix For: 0.22.0 Attachments: hdfs-1969.txt, hdfs-1969.txt The following sequence leaves the namespace in an inconsistent/broken state: - format NN using 0.20 (or any prior release, probably) - run hdfs namenode -upgrade on 0.22. ^C the NN once it comes up. - run hdfs namenode -rollback on 0.22 (this should fail but doesn't!) This leaves the name directory in a state such that the version file claims it's an 0.20 namespace, but the fsimage is in 0.22 format. It then crashes when trying to start up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-1987) HDFS-1073: Test for 2NN downloading image is not running
HDFS-1073: Test for 2NN downloading image is not running Key: HDFS-1987 URL: https://issues.apache.org/jira/browse/HDFS-1987 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Todd Lipcon Assignee: Todd Lipcon -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1987) HDFS-1073: Test for 2NN downloading image is not running
[ https://issues.apache.org/jira/browse/HDFS-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1987: -- Component/s: name-node Description: TestCheckpoint.testSecondaryImageDownload was introduced at some point but was never called from anywhere, so it wasn't actually running. This JIRA is to fix it up to work on trunk and actually run as part of the test suite. Affects Version/s: Edit log branch (HDFS-1073) Fix Version/s: Edit log branch (HDFS-1073) HDFS-1073: Test for 2NN downloading image is not running Key: HDFS-1987 URL: https://issues.apache.org/jira/browse/HDFS-1987 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) TestCheckpoint.testSecondaryImageDownload was introduced at some point but was never called from anywhere, so it wasn't actually running. This JIRA is to fix it up to work on trunk and actually run as part of the test suite. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1969) Running rollback on new-version namenode destroys namespace
[ https://issues.apache.org/jira/browse/HDFS-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038351#comment-13038351 ] Todd Lipcon commented on HDFS-1969: --- bq. I actually don't see where it is triggered. The test code uses this code in order to create VERSION files that look like they came from older versions. We could copy-paste some new code in to generate VERSION files, but that's a bit messy too. bq. The precondition sounds good. But it would be better to just convert it to assert. I don't think we've used com.google.* before at least not in HDFS. Why introduce it now. There was a vote on the mailing list a few months back and people said they were OK with including Guava (com.google.*). The advantage of Preconditions over assert is that Preconditions will always run regardless of JVM options. In areas that aren't performance-sensitive, this is preferred. Running rollback on new-version namenode destroys namespace --- Key: HDFS-1969 URL: https://issues.apache.org/jira/browse/HDFS-1969 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Fix For: 0.22.0 Attachments: hdfs-1969.txt, hdfs-1969.txt The following sequence leaves the namespace in an inconsistent/broken state: - format NN using 0.20 (or any prior release, probably) - run hdfs namenode -upgrade on 0.22. ^C the NN once it comes up. - run hdfs namenode -rollback on 0.22 (this should fail but doesn't!) This leaves the name directory in a state such that the version file claims it's an 0.20 namespace, but the fsimage is in 0.22 format. It then crashes when trying to start up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1987) HDFS-1073: Test for 2NN downloading image is not running
[ https://issues.apache.org/jira/browse/HDFS-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1987: -- Attachment: hdfs-1987.txt Attached patch makes this test case actually run, and fixes it up to work with the new edits log layout. HDFS-1073: Test for 2NN downloading image is not running Key: HDFS-1987 URL: https://issues.apache.org/jira/browse/HDFS-1987 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) Attachments: hdfs-1987.txt TestCheckpoint.testSecondaryImageDownload was introduced at some point but was never called from anywhere, so it wasn't actually running. This JIRA is to fix it up to work on trunk and actually run as part of the test suite. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1984) HDFS-1073: Enable multiple checkpointers to run simultaneously
[ https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1984: -- Attachment: hdfs-1984.txt Here's a patch that does the above, and also adds two new test cases: 1) simulates a corrupt byte while transferring the image, making sure it correctly detects it and rejects the upload 2) runs two 2NNs interleaved using Mockito to be sure that they don't interfere with each other I also ran the test from the command line as described above. I was able to run two 2NNs both checkpointing as fast as they could. There was one minor unrelated race condition that I'll address as a followup. HDFS-1073: Enable multiple checkpointers to run simultaneously -- Key: HDFS-1984 URL: https://issues.apache.org/jira/browse/HDFS-1984 Project: Hadoop HDFS Issue Type: Sub-task Components: name-node Affects Versions: Edit log branch (HDFS-1073) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Edit log branch (HDFS-1073) Attachments: hdfs-1984.txt One of the motivations of HDFS-1073 is that it decouples the checkpoint process so that multiple checkpoints could be taken at the same time and not interfere with each other. Currently on the 1073 branch this doesn't quite work right, since we have some state and validation in FSImage that's tied to a single fsimage_N -- thus if two 2NNs perform a checkpoint at different transaction IDs, only one will succeed. As a stress test, we can run two 2NNs each configured with the fs.checkpoint.interval set to 0 which causes them to continuously checkpoint as fast as they can. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira