[jira] [Updated] (HDFS-3270) run valgrind on fuse-dfs, fix any memory leaks
[ https://issues.apache.org/jira/browse/HDFS-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3270: --- Attachment: HDFS-3270.001.patch * fix malloc check in hdfs.c (not actually a memory leak,b ut still bogus) run valgrind on fuse-dfs, fix any memory leaks -- Key: HDFS-3270 URL: https://issues.apache.org/jira/browse/HDFS-3270 Project: Hadoop HDFS Issue Type: Improvement Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3270.001.patch run valgrind on fuse-dfs, fix any memory leaks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3270) run valgrind on fuse-dfs, fix any memory leaks
[ https://issues.apache.org/jira/browse/HDFS-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3270: --- Status: Patch Available (was: Open) run valgrind on fuse-dfs, fix any memory leaks -- Key: HDFS-3270 URL: https://issues.apache.org/jira/browse/HDFS-3270 Project: Hadoop HDFS Issue Type: Improvement Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3270.001.patch run valgrind on fuse-dfs, fix any memory leaks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3270) run valgrind on fuse-dfs, fix any memory leaks
[ https://issues.apache.org/jira/browse/HDFS-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3270: --- Attachment: HDFS-3270.002.patch * fix another case of the same mistake run valgrind on fuse-dfs, fix any memory leaks -- Key: HDFS-3270 URL: https://issues.apache.org/jira/browse/HDFS-3270 Project: Hadoop HDFS Issue Type: Improvement Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3270.001.patch, HDFS-3270.002.patch run valgrind on fuse-dfs, fix any memory leaks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3306) fuse_dfs: don't lock release operations
[ https://issues.apache.org/jira/browse/HDFS-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3306: --- Attachment: HDFS-3306.001.patch fuse_dfs: don't lock release operations --- Key: HDFS-3306 URL: https://issues.apache.org/jira/browse/HDFS-3306 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3306.001.patch There's no need to lock release operations in FUSE, because release can only be called once on a fuse_file_info structure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3306) fuse_dfs: don't lock release operations
[ https://issues.apache.org/jira/browse/HDFS-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3306: --- Status: Patch Available (was: Open) fuse_dfs: don't lock release operations --- Key: HDFS-3306 URL: https://issues.apache.org/jira/browse/HDFS-3306 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3306.001.patch There's no need to lock release operations in FUSE, because release can only be called once on a fuse_file_info structure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3290) Use a better local directory layout for the datanode
[ https://issues.apache.org/jira/browse/HDFS-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3290: --- Description: When the HDFS DataNode stores chunks in a local directory, it currently puts all of the chunk files into either one big directory, or a collection of directories. However, there is no way to know which directory a given block will end up in, given its ID. As the number of files increases, this does not scale well. Similar to the git version control system, HDFS should create a few different top level directories keyed off of a few bits in the chunk ID. Git uses 8 bits. This substantially cuts down on the number of chunk files in the same directory and gives increased performance, while not compromising O(1) lookup of chunks. was: When the HDFS DataNode stores chunks in a local directory, it currently puts all of the chunk files into one big directory. As the number of files increases, this does not work well at all. Local filesystems are not optimized for the case where there are hundreds of thousands of files in the same directory. It also makes inspecting directories with standard UNIX tools difficult. Similar to the git version control system, HDFS should create a few different top level directories keyed off of a few bits in the chunk ID. Git uses 8 bits. This substantially cuts down on the number of chunk files in the same directory and gives increased performance. Use a better local directory layout for the datanode Key: HDFS-3290 URL: https://issues.apache.org/jira/browse/HDFS-3290 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 0.23.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor When the HDFS DataNode stores chunks in a local directory, it currently puts all of the chunk files into either one big directory, or a collection of directories. However, there is no way to know which directory a given block will end up in, given its ID. As the number of files increases, this does not scale well. Similar to the git version control system, HDFS should create a few different top level directories keyed off of a few bits in the chunk ID. Git uses 8 bits. This substantially cuts down on the number of chunk files in the same directory and gives increased performance, while not compromising O(1) lookup of chunks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups
[ https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3206: --- Attachment: HDFS-3206.004.patch * here's a patch with the binary part included. Hopefully jenkins won't choke on it... oev: miscellaneous xml cleanups --- Key: HDFS-3206 URL: https://issues.apache.org/jira/browse/HDFS-3206 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3206.001.patch, HDFS-3206.002.patch, HDFS-3206.003.patch, HDFS-3206.004.patch * SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. * consistently serialize generation stamp as GENSTAMP. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Attachment: HDFS-3049.003.patch rebase on trunk During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt Key: HDFS-3049 URL: https://issues.apache.org/jira/browse/HDFS-3049 Project: Hadoop HDFS Issue Type: New Feature Components: name-node Affects Versions: 0.23.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, HDFS-3049.003.patch During the NameNode startup process, we load an image, and then apply edit logs to it until we believe that we have all the latest changes. Unfortunately, if there is an I/O error while reading any of these files, in most cases, we simply abort the startup process. We should try harder to locate a readable edit log and/or image file. *There are three main use cases for this feature:* 1. If the operating system does not honor fsync (usually due to a misconfiguration), a file may end up in an inconsistent state. 2. In certain older releases where we did not use fallocate() or similar to pre-reserve blocks, a disk full condition may cause a truncated log in one edit directory. 3. There may be a bug in HDFS which results in some of the data directories receiving corrupt data, but not all. This is the least likely use case. *Proposed changes to normal NN startup* * We should try a different FSImage if we can't load the first one we try. * We should examine other FSEditLogs if we can't load the first one(s) we try. * We should fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. Proposed changes to recovery mode NN startup: we should list out all the available storage directories and allow the operator to select which one he wants to use. Something like this: {code} Multiple storage directories found. 1. /foo/bar edits__curent__XYZ size:213421345 md5:2345345 image size:213421345 md5:2345345 2. /foo/baz edits__curent__XYZ size:213421345 md5:2345345345 image size:213421345 md5:2345345 Which one would you like to use? (1/2) {code} As usual in recovery mode, we want to be flexible about error handling. In this case, this means that we should NOT fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. *Not addressed by this feature* This feature will not address the case where an attempt to access the NameNode name directory or directories hangs because of an I/O error. This may happen, for example, when trying to load an image from a hard-mounted NFS directory, when the NFS server has gone away. Just as now, the operator will have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups
[ https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3206: --- Attachment: HDFS-3206.003.patch oev: miscellaneous xml cleanups --- Key: HDFS-3206 URL: https://issues.apache.org/jira/browse/HDFS-3206 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3206.001.patch, HDFS-3206.002.patch, HDFS-3206.003.patch * SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. * consistently serialize generation stamp as GENSTAMP. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Attachment: HDFS-3049.002.patch * fix tests During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt Key: HDFS-3049 URL: https://issues.apache.org/jira/browse/HDFS-3049 Project: Hadoop HDFS Issue Type: New Feature Components: name-node Affects Versions: 0.23.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch During the NameNode startup process, we load an image, and then apply edit logs to it until we believe that we have all the latest changes. Unfortunately, if there is an I/O error while reading any of these files, in most cases, we simply abort the startup process. We should try harder to locate a readable edit log and/or image file. *There are three main use cases for this feature:* 1. If the operating system does not honor fsync (usually due to a misconfiguration), a file may end up in an inconsistent state. 2. In certain older releases where we did not use fallocate() or similar to pre-reserve blocks, a disk full condition may cause a truncated log in one edit directory. 3. There may be a bug in HDFS which results in some of the data directories receiving corrupt data, but not all. This is the least likely use case. *Proposed changes to normal NN startup* * We should try a different FSImage if we can't load the first one we try. * We should examine other FSEditLogs if we can't load the first one(s) we try. * We should fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. Proposed changes to recovery mode NN startup: we should list out all the available storage directories and allow the operator to select which one he wants to use. Something like this: {code} Multiple storage directories found. 1. /foo/bar edits__curent__XYZ size:213421345 md5:2345345 image size:213421345 md5:2345345 2. /foo/baz edits__curent__XYZ size:213421345 md5:2345345345 image size:213421345 md5:2345345 Which one would you like to use? (1/2) {code} As usual in recovery mode, we want to be flexible about error handling. In this case, this means that we should NOT fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. *Not addressed by this feature* This feature will not address the case where an attempt to access the NameNode name directory or directories hangs because of an I/O error. This may happen, for example, when trying to load an image from a hard-mounted NFS directory, when the NFS server has gone away. Just as now, the operator will have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Status: Open (was: Patch Available) harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, HDFS-3134.003.patch, HDFS-3134.004.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Attachment: HDFS-3134.005.patch new patch with the common stuff split out. Requires HADOOP-8275 harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, HDFS-3134.003.patch, HDFS-3134.004.patch, HDFS-3134.005.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different image or EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Attachment: HDFS-3049.001.patch * implement edit log failover (no image stuff in here) During the normal loading NN startup process, fall back on a different image or EditLog if we see one that is corrupt - Key: HDFS-3049 URL: https://issues.apache.org/jira/browse/HDFS-3049 Project: Hadoop HDFS Issue Type: New Feature Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 0.24.0 Attachments: HDFS-3049.001.patch During the NameNode startup process, we load an image, and then apply edit logs to it until we believe that we have all the latest changes. Unfortunately, if there is an I/O error while reading any of these files, in most cases, we simply abort the startup process. We should try harder to locate a readable edit log and/or image file. *There are three main use cases for this feature:* 1. If the operating system does not honor fsync (usually due to a misconfiguration), a file may end up in an inconsistent state. 2. In certain older releases where we did not use fallocate() or similar to pre-reserve blocks, a disk full condition may cause a truncated log in one edit directory. 3. There may be a bug in HDFS which results in some of the data directories receiving corrupt data, but not all. This is the least likely use case. *Proposed changes to normal NN startup* * We should try a different FSImage if we can't load the first one we try. * We should examine other FSEditLogs if we can't load the first one(s) we try. * We should fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. Proposed changes to recovery mode NN startup: we should list out all the available storage directories and allow the operator to select which one he wants to use. Something like this: {code} Multiple storage directories found. 1. /foo/bar edits__curent__XYZ size:213421345 md5:2345345 image size:213421345 md5:2345345 2. /foo/baz edits__curent__XYZ size:213421345 md5:2345345345 image size:213421345 md5:2345345 Which one would you like to use? (1/2) {code} As usual in recovery mode, we want to be flexible about error handling. In this case, this means that we should NOT fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. *Not addressed by this feature* This feature will not address the case where an attempt to access the NameNode name directory or directories hangs because of an I/O error. This may happen, for example, when trying to load an image from a hard-mounted NFS directory, when the NFS server has gone away. Just as now, the operator will have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Summary: During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt (was: During the normal loading NN startup process, fall back on a different image or EditLog if we see one that is corrupt) remove references to FSImage (there is now a separate JIRA for that) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt Key: HDFS-3049 URL: https://issues.apache.org/jira/browse/HDFS-3049 Project: Hadoop HDFS Issue Type: New Feature Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 0.24.0 Attachments: HDFS-3049.001.patch During the NameNode startup process, we load an image, and then apply edit logs to it until we believe that we have all the latest changes. Unfortunately, if there is an I/O error while reading any of these files, in most cases, we simply abort the startup process. We should try harder to locate a readable edit log and/or image file. *There are three main use cases for this feature:* 1. If the operating system does not honor fsync (usually due to a misconfiguration), a file may end up in an inconsistent state. 2. In certain older releases where we did not use fallocate() or similar to pre-reserve blocks, a disk full condition may cause a truncated log in one edit directory. 3. There may be a bug in HDFS which results in some of the data directories receiving corrupt data, but not all. This is the least likely use case. *Proposed changes to normal NN startup* * We should try a different FSImage if we can't load the first one we try. * We should examine other FSEditLogs if we can't load the first one(s) we try. * We should fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. Proposed changes to recovery mode NN startup: we should list out all the available storage directories and allow the operator to select which one he wants to use. Something like this: {code} Multiple storage directories found. 1. /foo/bar edits__curent__XYZ size:213421345 md5:2345345 image size:213421345 md5:2345345 2. /foo/baz edits__curent__XYZ size:213421345 md5:2345345345 image size:213421345 md5:2345345 Which one would you like to use? (1/2) {code} As usual in recovery mode, we want to be flexible about error handling. In this case, this means that we should NOT fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. *Not addressed by this feature* This feature will not address the case where an attempt to access the NameNode name directory or directories hangs because of an I/O error. This may happen, for example, when trying to load an image from a hard-mounted NFS directory, when the NFS server has gone away. Just as now, the operator will have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Affects Version/s: 0.23.0 Fix Version/s: 2.0.0 harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Fix For: 2.0.0 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, HDFS-3134.003.patch, HDFS-3134.004.patch, HDFS-3134.005.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Status: Patch Available (was: Open) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt Key: HDFS-3049 URL: https://issues.apache.org/jira/browse/HDFS-3049 Project: Hadoop HDFS Issue Type: New Feature Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 0.24.0 Attachments: HDFS-3049.001.patch During the NameNode startup process, we load an image, and then apply edit logs to it until we believe that we have all the latest changes. Unfortunately, if there is an I/O error while reading any of these files, in most cases, we simply abort the startup process. We should try harder to locate a readable edit log and/or image file. *There are three main use cases for this feature:* 1. If the operating system does not honor fsync (usually due to a misconfiguration), a file may end up in an inconsistent state. 2. In certain older releases where we did not use fallocate() or similar to pre-reserve blocks, a disk full condition may cause a truncated log in one edit directory. 3. There may be a bug in HDFS which results in some of the data directories receiving corrupt data, but not all. This is the least likely use case. *Proposed changes to normal NN startup* * We should try a different FSImage if we can't load the first one we try. * We should examine other FSEditLogs if we can't load the first one(s) we try. * We should fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. Proposed changes to recovery mode NN startup: we should list out all the available storage directories and allow the operator to select which one he wants to use. Something like this: {code} Multiple storage directories found. 1. /foo/bar edits__curent__XYZ size:213421345 md5:2345345 image size:213421345 md5:2345345 2. /foo/baz edits__curent__XYZ size:213421345 md5:2345345345 image size:213421345 md5:2345345 Which one would you like to use? (1/2) {code} As usual in recovery mode, we want to be flexible about error handling. In this case, this means that we should NOT fail if we can't find EditLogs that would bring us up to what we believe is the latest transaction ID. *Not addressed by this feature* This feature will not address the case where an attempt to access the NameNode name directory or directories hangs because of an I/O error. This may happen, for example, when trying to load an image from a hard-mounted NFS directory, when the NFS server has gone away. Just as now, the operator will have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Target Version/s: 2.0.0 Fix Version/s: (was: 2.0.0) harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, HDFS-3134.003.patch, HDFS-3134.004.patch, HDFS-3134.005.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Release Note: This is a new feature. It is documented in hdfs_user_guide.xml. Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.1.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, HDFS-3055-b1.006.patch, HDFS-3055-b1.007.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Release Note: This is a new feature. It is documented in hdfs_user_guide.xml. Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Fix For: 2.0.0 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.043.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Attachment: HDFS-3134.004.patch * add unit test * make range check more flexible (adding upper as well as lower bound, make lower bound configurable) * fix bug where we might not decode certain DelegationKey objects because we encoded them with length = -1 (i.e. no key) harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, HDFS-3134.003.patch, HDFS-3134.004.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.007.patch In the docs, refer to -force, not --force Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, HDFS-3055-b1.006.patch, HDFS-3055-b1.007.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3169) TestFsck should test multiple -move operations in a row
[ https://issues.apache.org/jira/browse/HDFS-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3169: --- Status: Patch Available (was: Open) TestFsck should test multiple -move operations in a row --- Key: HDFS-3169 URL: https://issues.apache.org/jira/browse/HDFS-3169 Project: Hadoop HDFS Issue Type: Improvement Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3169.001.patch TestFsck should test multiple -move operations in a row. Overall, it would be nice to have more coverage on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Attachment: HDFS-3134.003.patch rebase on trunk harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, HDFS-3134.003.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3248) bootstrapstanby repeated twice in hdfs namenode usage message
[ https://issues.apache.org/jira/browse/HDFS-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3248: --- Status: Patch Available (was: Open) bootstrapstanby repeated twice in hdfs namenode usage message - Key: HDFS-3248 URL: https://issues.apache.org/jira/browse/HDFS-3248 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3248.002.patch The HDFS usage message repeats bootstrapStandby twice. {code} Usage: java NameNode [-backup] | [-checkpoint] | [-format[-clusterid cid ]] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint] | [-bootstrapStandby] | [-initializeSharedEdits] | [-bootstrapStandby] | [-recover [ -force ] ] {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3248) bootstrapstanby repeated twice in hdfs namenode usage message
[ https://issues.apache.org/jira/browse/HDFS-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3248: --- Attachment: HDFS-3248.002.patch bootstrapstanby repeated twice in hdfs namenode usage message - Key: HDFS-3248 URL: https://issues.apache.org/jira/browse/HDFS-3248 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3248.002.patch The HDFS usage message repeats bootstrapStandby twice. {code} Usage: java NameNode [-backup] | [-checkpoint] | [-format[-clusterid cid ]] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint] | [-bootstrapStandby] | [-initializeSharedEdits] | [-bootstrapStandby] | [-recover [ -force ] ] {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.005.patch * move askOperator to MetaRecoveryContext::editLogLoaderPrompt * remove unecessary toString() call * warn about losing data from your HDFS filesystem rather than losing data from your filesystem Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.006.patch * TestNameNodeRecovery: use StringUtils instead of StringWriter to serialize exception Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, HDFS-3055-b1.006.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: (was: HDFS-3004__namenode_recovery_tool.txt) Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004__namenode_recovery_tool.txt * update design document Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.042.patch * always prompt about possible data loss, even when -force is specified * update hdfs_user_guide.xml so that it talks about -force Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Attachment: HDFS-3134.001.patch * In the edit log loader, don't try to allocate arrays of negative length. Instead, throw an IOException. * When deserializing a variable length number into a java integer, do not ignore problems resulting from truncation-- throw an IOException instead. harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Status: Patch Available (was: Open) harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.004.patch * update patch to reflect comments from HDFS-3004 Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input
[ https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3134: --- Attachment: HDFS-3134.002.patch harden edit log loader against malformed or malicious input --- Key: HDFS-3134 URL: https://issues.apache.org/jira/browse/HDFS-3134 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch Currently, the edit log loader does not handle bad or malicious input sensibly. We can often cause OutOfMemory exceptions, null pointer exceptions, or other unchecked exceptions to be thrown by feeding the edit log loader bad input. In some environments, an out of memory error can cause the JVM process to be terminated. It's clear that we want these exceptions to be thrown as IOException instead of as unchecked exceptions. We also want to avoid out of memory situations. The main task here is to put a sensible upper limit on the lengths of arrays and strings we allocate on command. The other task is to try to avoid creating unchecked exceptions (by dereferencing potentially-NULL pointers, for example). Instead, we should verify ahead of time and give a more sensible error message that reflects the problem with the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.042.patch * reposting so jenkins will test Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3206) oev: correctly serialize SetOwner operations in which the user is not changed
[ https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3206: --- Attachment: HDFS-3206.001.patch * fix oev: correctly serialize SetOwner operations in which the user is not changed - Key: HDFS-3206 URL: https://issues.apache.org/jira/browse/HDFS-3206 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3206.001.patch SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups
[ https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3206: --- Description: * SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. * consistently serialize generation stamp as GENSTAMP. was:SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. Summary: oev: miscellaneous xml cleanups (was: oev: correctly serialize SetOwner operations in which the user is not changed) oev: miscellaneous xml cleanups --- Key: HDFS-3206 URL: https://issues.apache.org/jira/browse/HDFS-3206 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3206.001.patch * SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. * consistently serialize generation stamp as GENSTAMP. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups
[ https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3206: --- Status: Patch Available (was: Open) oev: miscellaneous xml cleanups --- Key: HDFS-3206 URL: https://issues.apache.org/jira/browse/HDFS-3206 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3206.001.patch * SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. * consistently serialize generation stamp as GENSTAMP. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.003.patch * rebase on branch-1 todd: yes, this is up to date, and I've run the following tests: TestCheckpoint, TestEditLog, TestNameNodeRecovery, TestEditLogLoading, TestNameNodeMXBean, TestSaveNamespace, TestSecurityTokenEditLog, TestStorageDirectoryFailure, TestStorageRestore Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, HDFS-3055-b1.003.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.039.patch * rebase on trunk * rename RecoveryContext to MetaRecoveryContext * rename -autoChooseDefault to -noPrompt * EditLogInputException.java: remove pointless whitespace change * some whitespace and punctuation improvements to the recovery prompt text. Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups
[ https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3206: --- Attachment: HDFS-3206.002.patch * put OP_END_LOG_SEGMENT at the end of the edit log, which it would be in a real edit log. oev: miscellaneous xml cleanups --- Key: HDFS-3206 URL: https://issues.apache.org/jira/browse/HDFS-3206 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3206.001.patch, HDFS-3206.002.patch * SetOwner operations can change both the user and group which a file or directory belongs to, or just one of those. Currently, in the XML serialization/deserialization code, we don't handle the case where just the group is set, not the user. We should handle this case. * consistently serialize generation stamp as GENSTAMP. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.040.patch * update findbugsExcludeFile.xml to reflect the fact that RecoveryContext is now called MetaRecoveryContext Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.041.patch * implement -force Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.019.patch * fix error handling bug that could lead to open files getting leaked (theoretically) * suppress javadoc warnings resulting from com.sun.* API use rework OEV to share more code with the NameNode --- Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch, HDFS-3050.018.patch, HDFS-3050.019.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.020.patch * update OK_JAVADOC_WARNINGS rework OEV to share more code with the NameNode --- Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch, HDFS-3050.018.patch, HDFS-3050.019.patch, HDFS-3050.020.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.038.patch * rebase Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.002.patch * add unit test * some fixes to NN unclean shutdown (to allow unit test to work) * better error reporting for the branch-1 edit log stuff (print out the offset when we encounter a problem) Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1378) Edit log replay should track and report file offsets in case of errors
[ https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-1378: --- Attachment: HDFS-1378-b1.002.patch * port to branch-1 Edit log replay should track and report file offsets in case of errors -- Key: HDFS-1378 URL: https://issues.apache.org/jira/browse/HDFS-1378 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Fix For: 0.23.0 Attachments: HDFS-1378-b1.002.patch, hdfs-1378-branch20.txt, hdfs-1378.0.patch, hdfs-1378.1.patch, hdfs-1378.2.txt Occasionally there are bugs or operational mistakes that result in corrupt edit logs which I end up having to repair by hand. In these cases it would be very handy to have the error message also print out the file offsets of the last several edit log opcodes so it's easier to find the right place to edit in the OP_INVALID marker. We could also use this facility to provide a rough estimate of how far along edit log replay the NN is during startup (handy when a 2NN has died and replay takes a while) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1378) Edit log replay should track and report file offsets in case of errors
[ https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-1378: --- Attachment: HDFS-1378-b1.003.patch * include bug fix from revised patch * backport unit test as well Edit log replay should track and report file offsets in case of errors -- Key: HDFS-1378 URL: https://issues.apache.org/jira/browse/HDFS-1378 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Fix For: 0.23.0 Attachments: HDFS-1378-b1.002.patch, HDFS-1378-b1.003.patch, hdfs-1378-branch20.txt, hdfs-1378.0.patch, hdfs-1378.1.patch, hdfs-1378.2.txt Occasionally there are bugs or operational mistakes that result in corrupt edit logs which I end up having to repair by hand. In these cases it would be very handy to have the error message also print out the file offsets of the last several edit log opcodes so it's easier to find the right place to edit in the OP_INVALID marker. We could also use this facility to provide a rough estimate of how far along edit log replay the NN is during startup (handy when a 2NN has died and replay takes a while) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1378) Edit log replay should track and report file offsets in case of errors
[ https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-1378: --- Attachment: HDFS-1378-b1.004.patch * add unit test Edit log replay should track and report file offsets in case of errors -- Key: HDFS-1378 URL: https://issues.apache.org/jira/browse/HDFS-1378 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Fix For: 0.23.0 Attachments: HDFS-1378-b1.002.patch, HDFS-1378-b1.003.patch, HDFS-1378-b1.004.patch, hdfs-1378-branch20.txt, hdfs-1378.0.patch, hdfs-1378.1.patch, hdfs-1378.2.txt Occasionally there are bugs or operational mistakes that result in corrupt edit logs which I end up having to repair by hand. In these cases it would be very handy to have the error message also print out the file offsets of the last several edit log opcodes so it's easier to find the right place to edit in the OP_INVALID marker. We could also use this facility to provide a rough estimate of how far along edit log replay the NN is during startup (handy when a 2NN has died and replay takes a while) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.018.patch * rebase on latest trunk rework OEV to share more code with the NameNode --- Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch, HDFS-3050.018.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.017.patch * rebase on trunk rework OEV to share more code with the NameNode --- Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.037.patch * style fixes Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3181) org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3181: --- Attachment: testOut.txt exception, standard output, etc of a failure org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart fails intermittently Key: HDFS-3181 URL: https://issues.apache.org/jira/browse/HDFS-3181 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Attachments: testOut.txt org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart seems to be failing intermittently on jenkins. {code} org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart Failing for the past 1 build (Since Failed#2163 ) Took 8.4 sec. Error Message Lease mismatch on /hardLeaseRecovery owned by HDFS_NameNode but is accessed by DFSClient_NONMAPREDUCE_1147689755_1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655) Stacktrace org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch on /hardLeaseRecovery owned by HDFS_NameNode but is accessed by DFSClient_NONMAPREDUCE_1147689755_1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655) at org.apache.hadoop.ipc.Client.call(Client.java:1159) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:185) at $Proxy15.getAdditionalDatanode(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at $Proxy15.getAdditionalDatanode(Unknown Source) at
[jira] [Updated] (HDFS-3181) testHardLeaseRecoveryAfterNameNodeRestart fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3181: --- Summary: testHardLeaseRecoveryAfterNameNodeRestart fails intermittently (was: org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart fails intermittently) testHardLeaseRecoveryAfterNameNodeRestart fails intermittently -- Key: HDFS-3181 URL: https://issues.apache.org/jira/browse/HDFS-3181 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Attachments: testOut.txt org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart seems to be failing intermittently on jenkins. {code} org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart Failing for the past 1 build (Since Failed#2163 ) Took 8.4 sec. Error Message Lease mismatch on /hardLeaseRecovery owned by HDFS_NameNode but is accessed by DFSClient_NONMAPREDUCE_1147689755_1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655) Stacktrace org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch on /hardLeaseRecovery owned by HDFS_NameNode but is accessed by DFSClient_NONMAPREDUCE_1147689755_1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655) at org.apache.hadoop.ipc.Client.call(Client.java:1159) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:185) at $Proxy15.getAdditionalDatanode(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at $Proxy15.getAdditionalDatanode(Unknown Source) at
[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default
[ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --- Attachment: HDFS-3044-b1.004.patch * remember to run fsck -delete before checking to see if the file is really deleted (d'oh!) * add test that running fsck -move a few times in a row has no harmful effects fsck move should be non-destructive by default -- Key: HDFS-3044 URL: https://issues.apache.org/jira/browse/HDFS-3044 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Eli Collins Assignee: Colin Patrick McCabe Fix For: 1.1.0, 2.0.0 Attachments: HDFS-3044-b1.002.patch, HDFS-3044-b1.004.patch, HDFS-3044.002.patch, HDFS-3044.003.patch The fsck move behavior in the code and originally articulated in HADOOP-101 is: {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to fix them would be to recover chains of blocks and put them into lost+found{quote} A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt - Admin does fsck move, which deletes the corrupt files, saves whatever blocks were available - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. I think we should: - Make fsck move non-destructive by default (eg just does a move into lost+found) - Make the destructive behavior optional (eg --destructive so admins think about what they're doing) - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Description: Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. was: Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. Summary: rework OEV to share more code with the NameNode (was: refactor OEV to share more code with the NameNode) rework OEV to share more code with the NameNode --- Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.016.patch * rebase against current trunk * posted improved patch description and name rework OEV to share more code with the NameNode --- Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch, HDFS-3050.016.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. By using the existing FSEditLogLoader code to load edits in OEV, we can avoid having to update two places when the format changes. We should not put opcode checksums into the XML, because they are a serialization detail, not related to what the data is what we're storing. This will also make it possible to modify the XML file and translate this modified file back to a binary edits log file. Finally, this changes introduces --fix-txids. When OEV is passed this flag, it will close gaps in the transaction log by modifying the sequence numbers. This is useful if you want to modify the edit log XML (say, by removing a transaction), and transform the modified XML back into a valid binary edit log file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.036.patch * rebase on trunk * slight cleanup of EditLogBackupInputStream::nextValidOp() Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.015.patch * set 2-space indentation in output XML to match old XML * add Javadoc comments to XMLUtils refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, HDFS-3050.015.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default
[ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --- Attachment: HDFS-3050-b1.001.patch * port to branch-1 fsck move should be non-destructive by default -- Key: HDFS-3044 URL: https://issues.apache.org/jira/browse/HDFS-3044 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Eli Collins Assignee: Colin Patrick McCabe Fix For: 2.0.0 Attachments: HDFS-3044.002.patch, HDFS-3044.003.patch, HDFS-3050-b1.001.patch The fsck move behavior in the code and originally articulated in HADOOP-101 is: {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to fix them would be to recover chains of blocks and put them into lost+found{quote} A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt - Admin does fsck move, which deletes the corrupt files, saves whatever blocks were available - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. I think we should: - Make fsck move non-destructive by default (eg just does a move into lost+found) - Make the destructive behavior optional (eg --destructive so admins think about what they're doing) - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default
[ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --- Attachment: HDFS-3044-b1.002.patch * fix patch name fsck move should be non-destructive by default -- Key: HDFS-3044 URL: https://issues.apache.org/jira/browse/HDFS-3044 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Eli Collins Assignee: Colin Patrick McCabe Fix For: 2.0.0 Attachments: HDFS-3044-b1.002.patch, HDFS-3044.002.patch, HDFS-3044.003.patch The fsck move behavior in the code and originally articulated in HADOOP-101 is: {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to fix them would be to recover chains of blocks and put them into lost+found{quote} A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt - Admin does fsck move, which deletes the corrupt files, saves whatever blocks were available - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. I think we should: - Make fsck move non-destructive by default (eg just does a move into lost+found) - Make the destructive behavior optional (eg --destructive so admins think about what they're doing) - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default
[ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --- Attachment: (was: HDFS-3050-b1.001.patch) fsck move should be non-destructive by default -- Key: HDFS-3044 URL: https://issues.apache.org/jira/browse/HDFS-3044 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Eli Collins Assignee: Colin Patrick McCabe Fix For: 2.0.0 Attachments: HDFS-3044-b1.002.patch, HDFS-3044.002.patch, HDFS-3044.003.patch The fsck move behavior in the code and originally articulated in HADOOP-101 is: {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to fix them would be to recover chains of blocks and put them into lost+found{quote} A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt - Admin does fsck move, which deletes the corrupt files, saves whatever blocks were available - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. I think we should: - Make fsck move non-destructive by default (eg just does a move into lost+found) - Make the destructive behavior optional (eg --destructive so admins think about what they're doing) - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.035.patch * TestNameNodeRecovery: don't need data nodes for this test * TestNameNodeRecovery: use set rather than array Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.034.patch * address Todd's suggestions Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.014.patch * fix unit test * enable reading edit logs from XML * add -f / -fix-txids option, which makes oev close any holes in the transaction ID series. * Editing the XML no longer allows you to manually recalculate checksums for the edited opcode. refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.012.patch * fix findbugs warnings refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch, HDFS-3050.012.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.033.patch * rebase on trunk * fix some resource leaks in TestNameNodeRecovery * some whitespace and logging cleanups Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.032.patch * hdfs_user_guide.xml: fix paragraph breaks Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1
[ https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3055: --- Attachment: HDFS-3055-b1.001.patch * initial version Implement recovery mode for branch-1 Key: HDFS-3055 URL: https://issues.apache.org/jira/browse/HDFS-3055 Project: Hadoop HDFS Issue Type: New Feature Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Fix For: 1.0.0 Attachments: HDFS-3055-b1.001.patch Implement recovery mode for branch-1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.027.patch * fix bug uncovered by jenkins * add a little more debug in TestNameNodeRecovery Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.009.patch rebase refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.009.patch refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: (was: HDFS-3050.009.patch) refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.010.patch remove all changes to common refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3129) NetworkTopology: add test that getLeaf should check for invalid topologies
[ https://issues.apache.org/jira/browse/HDFS-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3129: --- Attachment: HDFS-3129.001.patch NetworkTopology: add test that getLeaf should check for invalid topologies -- Key: HDFS-3129 URL: https://issues.apache.org/jira/browse/HDFS-3129 Project: Hadoop HDFS Issue Type: Test Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3129.001.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.011.patch * fix broken import lines refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, HDFS-3050.011.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.029.patch * rebase on trunk * fix findbugs suppression Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3129) NetworkTopology: add test that getLeaf should check for invalid topologies
[ https://issues.apache.org/jira/browse/HDFS-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3129: --- Attachment: HDFS-3129-b1.001.patch * branch-1 version NetworkTopology: add test that getLeaf should check for invalid topologies -- Key: HDFS-3129 URL: https://issues.apache.org/jira/browse/HDFS-3129 Project: Hadoop HDFS Issue Type: Test Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3129-b1.001.patch, HDFS-3129.001.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.030.patch fix bug in prompting Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.031.patch testing done: manually corrupted an edit log, recovered it with recovery mode. patch update: whitespace tweak for recovery prompt Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.024.patch * HdfsServerConstants: use equals for string equality * fix bugs with the upgrade process Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.026.patch rebase on latest trunk Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.020.patch Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Status: Open (was: Patch Available) Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Status: Patch Available (was: Open) Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.007.patch rebase on latest trunk refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: (was: HDFS-3050.008.patch) refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode
[ https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3050: --- Attachment: HDFS-3050.008.patch HDFS-3050.008.patch refactor OEV to share more code with the NameNode - Key: HDFS-3050 URL: https://issues.apache.org/jira/browse/HDFS-3050 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, HDFS-3050.008.patch Current, OEV (the offline edits viewer) re-implements all of the opcode parsing logic found in the NameNode. This duplicated code creates a maintenance burden for us. OEV should be refactored to simply use the normal EditLog parsing code, rather than rolling its own. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default
[ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --- Attachment: HDFS-3044.003.patch address eli's comments fsck move should be non-destructive by default -- Key: HDFS-3044 URL: https://issues.apache.org/jira/browse/HDFS-3044 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Eli Collins Assignee: Colin Patrick McCabe Attachments: HDFS-3044.002.patch, HDFS-3044.003.patch The fsck move behavior in the code and originally articulated in HADOOP-101 is: {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to fix them would be to recover chains of blocks and put them into lost+found{quote} A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt - Admin does fsck move, which deletes the corrupt files, saves whatever blocks were available - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. I think we should: - Make fsck move non-destructive by default (eg just does a move into lost+found) - Make the destructive behavior optional (eg --destructive so admins think about what they're doing) - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.022.patch * remove some unecessary whitespace changes * re-introduce EditLogInputException * edit log input stream: change API as we discussed. * FSEditLogLoader: re-organize this file. Fix some corner cases relating to out-of-order transaction IDs Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.023.patch * OpInstanceCache needs to be thread-local to work correctly * update exception text regex in TestFSEditLogLoader Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.019.patch * remove obsolete references to JournalStream in comments * rename resync to skipBrokenEdits * rename -f to -chooseFirst * fix EditLogInputStream comments * ELIS::rewind - ELIS::putOp * FSEditLogLoader: fix a case where the numEdits return could be incorrect * FSEditLogLoader: improve handling of missing transactions * fix some cases in which we had been assuming that there are no gaps in the transaction log stream. Try to avoid doing fancy arithmetic by counting the number of edits we've decoded, etc. Instead, just rely on looking at the transaction ID of the last edit we decoded. Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.016.patch * rebase on latest trunk * make nextOp protected in all subclasses of ELIS Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.017.patch add a section about recovery to hdfs_user_guide.xml Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.018.patch * move to using a per-Reader or per-FSEditLog operation cache, rather than a purely per-thread operation cache. Now that we are caching a single opcode inside the FSInputStream, we definitely don't want these instances shared between threads. Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.012.patch * make more exceptions skippable * rename StartupOption.ALWAYS_CHOOSE_YES to StartupOption.ALWAYS_CHOOSE_FIRST, to better reflect what it does. * refactor EditLogInputStream a bit Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.013.patch * remove SkippableEditLogException, as it turned out not to be necessary * test skipping in EditLogInputStream Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3004) Implement Recovery Mode
[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --- Attachment: HDFS-3004.015.patch fix small bug in opcode skipping Implement Recovery Mode --- Key: HDFS-3004 URL: https://issues.apache.org/jira/browse/HDFS-3004 Project: Hadoop HDFS Issue Type: New Feature Components: tools Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004__namenode_recovery_tool.txt When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default
[ https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3044: --- Attachment: (was: HDFS-3044.001.patch) fsck move should be non-destructive by default -- Key: HDFS-3044 URL: https://issues.apache.org/jira/browse/HDFS-3044 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Eli Collins Assignee: Colin Patrick McCabe Attachments: HDFS-3044.002.patch The fsck move behavior in the code and originally articulated in HADOOP-101 is: {quote}Current failure modes for DFS involve blocks that are completely missing. The only way to fix them would be to recover chains of blocks and put them into lost+found{quote} A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed. I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg: - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt - Admin does fsck move, which deletes the corrupt files, saves whatever blocks were available - The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted. I think we should: - Make fsck move non-destructive by default (eg just does a move into lost+found) - Make the destructive behavior optional (eg --destructive so admins think about what they're doing) - Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira