[jira] [Updated] (HDFS-3270) run valgrind on fuse-dfs, fix any memory leaks

2012-04-19 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3270:
---

Attachment: HDFS-3270.001.patch

* fix malloc check in hdfs.c

(not actually a memory leak,b ut still bogus)

 run valgrind on fuse-dfs, fix any memory leaks
 --

 Key: HDFS-3270
 URL: https://issues.apache.org/jira/browse/HDFS-3270
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3270.001.patch


 run valgrind on fuse-dfs, fix any memory leaks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3270) run valgrind on fuse-dfs, fix any memory leaks

2012-04-19 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3270:
---

Status: Patch Available  (was: Open)

 run valgrind on fuse-dfs, fix any memory leaks
 --

 Key: HDFS-3270
 URL: https://issues.apache.org/jira/browse/HDFS-3270
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3270.001.patch


 run valgrind on fuse-dfs, fix any memory leaks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3270) run valgrind on fuse-dfs, fix any memory leaks

2012-04-19 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3270:
---

Attachment: HDFS-3270.002.patch

* fix another case of the same mistake

 run valgrind on fuse-dfs, fix any memory leaks
 --

 Key: HDFS-3270
 URL: https://issues.apache.org/jira/browse/HDFS-3270
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3270.001.patch, HDFS-3270.002.patch


 run valgrind on fuse-dfs, fix any memory leaks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3306) fuse_dfs: don't lock release operations

2012-04-19 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3306:
---

Attachment: HDFS-3306.001.patch

 fuse_dfs: don't lock release operations
 ---

 Key: HDFS-3306
 URL: https://issues.apache.org/jira/browse/HDFS-3306
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3306.001.patch


 There's no need to lock release operations in FUSE, because release can only 
 be called once on a fuse_file_info structure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3306) fuse_dfs: don't lock release operations

2012-04-19 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3306:
---

Status: Patch Available  (was: Open)

 fuse_dfs: don't lock release operations
 ---

 Key: HDFS-3306
 URL: https://issues.apache.org/jira/browse/HDFS-3306
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3306.001.patch


 There's no need to lock release operations in FUSE, because release can only 
 be called once on a fuse_file_info structure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3290) Use a better local directory layout for the datanode

2012-04-18 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3290:
---

Description: 
When the HDFS DataNode stores chunks in a local directory, it currently puts 
all of the chunk files into either one big directory, or a collection of 
directories.  However, there is no way to know which directory a given block 
will end up in, given its ID.  As the number of files increases, this does not 
scale well.

Similar to the git version control system, HDFS should create a few different 
top level directories keyed off of a few bits in the chunk ID.  Git uses 8 
bits.  This substantially cuts down on the number of chunk files in the same 
directory and gives increased performance, while not compromising O(1) lookup 
of chunks.

  was:
When the HDFS DataNode stores chunks in a local directory, it currently puts 
all of the chunk files into one big directory.  As the number of files 
increases, this does not work well at all.  Local filesystems are not optimized 
for the case where there are hundreds of thousands of files in the same 
directory.  It also makes inspecting directories with standard UNIX tools 
difficult.

Similar to the git version control system, HDFS should create a few different 
top level directories keyed off of a few bits in the chunk ID.  Git uses 8 
bits.  This substantially cuts down on the number of chunk files in the same 
directory and gives increased performance.


 Use a better local directory layout for the datanode
 

 Key: HDFS-3290
 URL: https://issues.apache.org/jira/browse/HDFS-3290
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 0.23.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor

 When the HDFS DataNode stores chunks in a local directory, it currently puts 
 all of the chunk files into either one big directory, or a collection of 
 directories.  However, there is no way to know which directory a given block 
 will end up in, given its ID.  As the number of files increases, this does 
 not scale well.
 Similar to the git version control system, HDFS should create a few different 
 top level directories keyed off of a few bits in the chunk ID.  Git uses 8 
 bits.  This substantially cuts down on the number of chunk files in the same 
 directory and gives increased performance, while not compromising O(1) lookup 
 of chunks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups

2012-04-18 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3206:
---

Attachment: HDFS-3206.004.patch

* here's a patch with the binary part included.  Hopefully jenkins won't choke 
on it...

 oev: miscellaneous xml cleanups
 ---

 Key: HDFS-3206
 URL: https://issues.apache.org/jira/browse/HDFS-3206
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3206.001.patch, HDFS-3206.002.patch, 
 HDFS-3206.003.patch, HDFS-3206.004.patch


 * SetOwner operations can change both the user and group which a file or 
 directory belongs to, or just one of those.  Currently, in the XML 
 serialization/deserialization code, we don't handle the case where just the 
 group is set, not the user.  We should handle this case.
 * consistently serialize generation stamp as GENSTAMP.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-04-17 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Attachment: HDFS-3049.003.patch

rebase on trunk

 During the normal loading NN startup process, fall back on a different 
 EditLog if we see one that is corrupt
 

 Key: HDFS-3049
 URL: https://issues.apache.org/jira/browse/HDFS-3049
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: name-node
Affects Versions: 0.23.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
 HDFS-3049.003.patch


 During the NameNode startup process, we load an image, and then apply edit 
 logs to it until we believe that we have all the latest changes.  
 Unfortunately, if there is an I/O error while reading any of these files, in 
 most cases, we simply abort the startup process.  We should try harder to 
 locate a readable edit log and/or image file.
 *There are three main use cases for this feature:*
 1. If the operating system does not honor fsync (usually due to a 
 misconfiguration), a file may end up in an inconsistent state.
 2. In certain older releases where we did not use fallocate() or similar to 
 pre-reserve blocks, a disk full condition may cause a truncated log in one 
 edit directory.
 3. There may be a bug in HDFS which results in some of the data directories 
 receiving corrupt data, but not all.  This is the least likely use case.
 *Proposed changes to normal NN startup*
 * We should try a different FSImage if we can't load the first one we try.
 * We should examine other FSEditLogs if we can't load the first one(s) we try.
 * We should fail if we can't find EditLogs that would bring us up to what we 
 believe is the latest transaction ID.
 Proposed changes to recovery mode NN startup:
 we should list out all the available storage directories and allow the 
 operator to select which one he wants to use.
 Something like this:
 {code}
 Multiple storage directories found.
 1. /foo/bar
 edits__curent__XYZ  size:213421345   md5:2345345
 image  size:213421345   md5:2345345
 2. /foo/baz
 edits__curent__XYZ  size:213421345   md5:2345345345
 image  size:213421345   md5:2345345
 Which one would you like to use? (1/2)
 {code}
 As usual in recovery mode, we want to be flexible about error handling.  In 
 this case, this means that we should NOT fail if we can't find EditLogs that 
 would bring us up to what we believe is the latest transaction ID.
 *Not addressed by this feature*
 This feature will not address the case where an attempt to access the 
 NameNode name directory or directories hangs because of an I/O error.  This 
 may happen, for example, when trying to load an image from a hard-mounted NFS 
 directory, when the NFS server has gone away.  Just as now, the operator will 
 have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups

2012-04-17 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3206:
---

Attachment: HDFS-3206.003.patch

 oev: miscellaneous xml cleanups
 ---

 Key: HDFS-3206
 URL: https://issues.apache.org/jira/browse/HDFS-3206
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3206.001.patch, HDFS-3206.002.patch, 
 HDFS-3206.003.patch


 * SetOwner operations can change both the user and group which a file or 
 directory belongs to, or just one of those.  Currently, in the XML 
 serialization/deserialization code, we don't handle the case where just the 
 group is set, not the user.  We should handle this case.
 * consistently serialize generation stamp as GENSTAMP.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-04-16 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Attachment: HDFS-3049.002.patch

* fix tests

 During the normal loading NN startup process, fall back on a different 
 EditLog if we see one that is corrupt
 

 Key: HDFS-3049
 URL: https://issues.apache.org/jira/browse/HDFS-3049
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: name-node
Affects Versions: 0.23.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch


 During the NameNode startup process, we load an image, and then apply edit 
 logs to it until we believe that we have all the latest changes.  
 Unfortunately, if there is an I/O error while reading any of these files, in 
 most cases, we simply abort the startup process.  We should try harder to 
 locate a readable edit log and/or image file.
 *There are three main use cases for this feature:*
 1. If the operating system does not honor fsync (usually due to a 
 misconfiguration), a file may end up in an inconsistent state.
 2. In certain older releases where we did not use fallocate() or similar to 
 pre-reserve blocks, a disk full condition may cause a truncated log in one 
 edit directory.
 3. There may be a bug in HDFS which results in some of the data directories 
 receiving corrupt data, but not all.  This is the least likely use case.
 *Proposed changes to normal NN startup*
 * We should try a different FSImage if we can't load the first one we try.
 * We should examine other FSEditLogs if we can't load the first one(s) we try.
 * We should fail if we can't find EditLogs that would bring us up to what we 
 believe is the latest transaction ID.
 Proposed changes to recovery mode NN startup:
 we should list out all the available storage directories and allow the 
 operator to select which one he wants to use.
 Something like this:
 {code}
 Multiple storage directories found.
 1. /foo/bar
 edits__curent__XYZ  size:213421345   md5:2345345
 image  size:213421345   md5:2345345
 2. /foo/baz
 edits__curent__XYZ  size:213421345   md5:2345345345
 image  size:213421345   md5:2345345
 Which one would you like to use? (1/2)
 {code}
 As usual in recovery mode, we want to be flexible about error handling.  In 
 this case, this means that we should NOT fail if we can't find EditLogs that 
 would bring us up to what we believe is the latest transaction ID.
 *Not addressed by this feature*
 This feature will not address the case where an attempt to access the 
 NameNode name directory or directories hangs because of an I/O error.  This 
 may happen, for example, when trying to load an image from a hard-mounted NFS 
 directory, when the NFS server has gone away.  Just as now, the operator will 
 have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Status: Open  (was: Patch Available)

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, 
 HDFS-3134.003.patch, HDFS-3134.004.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Attachment: HDFS-3134.005.patch

new patch with the common stuff split out.  Requires HADOOP-8275

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, 
 HDFS-3134.003.patch, HDFS-3134.004.patch, HDFS-3134.005.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different image or EditLog if we see one that is corrupt

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Attachment: HDFS-3049.001.patch

* implement edit log failover (no image stuff in here)

 During the normal loading NN startup process, fall back on a different image 
 or EditLog if we see one that is corrupt
 -

 Key: HDFS-3049
 URL: https://issues.apache.org/jira/browse/HDFS-3049
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 0.24.0

 Attachments: HDFS-3049.001.patch


 During the NameNode startup process, we load an image, and then apply edit 
 logs to it until we believe that we have all the latest changes.  
 Unfortunately, if there is an I/O error while reading any of these files, in 
 most cases, we simply abort the startup process.  We should try harder to 
 locate a readable edit log and/or image file.
 *There are three main use cases for this feature:*
 1. If the operating system does not honor fsync (usually due to a 
 misconfiguration), a file may end up in an inconsistent state.
 2. In certain older releases where we did not use fallocate() or similar to 
 pre-reserve blocks, a disk full condition may cause a truncated log in one 
 edit directory.
 3. There may be a bug in HDFS which results in some of the data directories 
 receiving corrupt data, but not all.  This is the least likely use case.
 *Proposed changes to normal NN startup*
 * We should try a different FSImage if we can't load the first one we try.
 * We should examine other FSEditLogs if we can't load the first one(s) we try.
 * We should fail if we can't find EditLogs that would bring us up to what we 
 believe is the latest transaction ID.
 Proposed changes to recovery mode NN startup:
 we should list out all the available storage directories and allow the 
 operator to select which one he wants to use.
 Something like this:
 {code}
 Multiple storage directories found.
 1. /foo/bar
 edits__curent__XYZ  size:213421345   md5:2345345
 image  size:213421345   md5:2345345
 2. /foo/baz
 edits__curent__XYZ  size:213421345   md5:2345345345
 image  size:213421345   md5:2345345
 Which one would you like to use? (1/2)
 {code}
 As usual in recovery mode, we want to be flexible about error handling.  In 
 this case, this means that we should NOT fail if we can't find EditLogs that 
 would bring us up to what we believe is the latest transaction ID.
 *Not addressed by this feature*
 This feature will not address the case where an attempt to access the 
 NameNode name directory or directories hangs because of an I/O error.  This 
 may happen, for example, when trying to load an image from a hard-mounted NFS 
 directory, when the NFS server has gone away.  Just as now, the operator will 
 have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Summary: During the normal loading NN startup process, fall back on a 
different EditLog if we see one that is corrupt  (was: During the normal 
loading NN startup process, fall back on a different image or EditLog if we see 
one that is corrupt)

remove references to FSImage (there is now a separate JIRA for that)

 During the normal loading NN startup process, fall back on a different 
 EditLog if we see one that is corrupt
 

 Key: HDFS-3049
 URL: https://issues.apache.org/jira/browse/HDFS-3049
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 0.24.0

 Attachments: HDFS-3049.001.patch


 During the NameNode startup process, we load an image, and then apply edit 
 logs to it until we believe that we have all the latest changes.  
 Unfortunately, if there is an I/O error while reading any of these files, in 
 most cases, we simply abort the startup process.  We should try harder to 
 locate a readable edit log and/or image file.
 *There are three main use cases for this feature:*
 1. If the operating system does not honor fsync (usually due to a 
 misconfiguration), a file may end up in an inconsistent state.
 2. In certain older releases where we did not use fallocate() or similar to 
 pre-reserve blocks, a disk full condition may cause a truncated log in one 
 edit directory.
 3. There may be a bug in HDFS which results in some of the data directories 
 receiving corrupt data, but not all.  This is the least likely use case.
 *Proposed changes to normal NN startup*
 * We should try a different FSImage if we can't load the first one we try.
 * We should examine other FSEditLogs if we can't load the first one(s) we try.
 * We should fail if we can't find EditLogs that would bring us up to what we 
 believe is the latest transaction ID.
 Proposed changes to recovery mode NN startup:
 we should list out all the available storage directories and allow the 
 operator to select which one he wants to use.
 Something like this:
 {code}
 Multiple storage directories found.
 1. /foo/bar
 edits__curent__XYZ  size:213421345   md5:2345345
 image  size:213421345   md5:2345345
 2. /foo/baz
 edits__curent__XYZ  size:213421345   md5:2345345345
 image  size:213421345   md5:2345345
 Which one would you like to use? (1/2)
 {code}
 As usual in recovery mode, we want to be flexible about error handling.  In 
 this case, this means that we should NOT fail if we can't find EditLogs that 
 would bring us up to what we believe is the latest transaction ID.
 *Not addressed by this feature*
 This feature will not address the case where an attempt to access the 
 NameNode name directory or directories hangs because of an I/O error.  This 
 may happen, for example, when trying to load an image from a hard-mounted NFS 
 directory, when the NFS server has gone away.  Just as now, the operator will 
 have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Affects Version/s: 0.23.0
Fix Version/s: 2.0.0

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.0.0

 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, 
 HDFS-3134.003.patch, HDFS-3134.004.patch, HDFS-3134.005.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Status: Patch Available  (was: Open)

 During the normal loading NN startup process, fall back on a different 
 EditLog if we see one that is corrupt
 

 Key: HDFS-3049
 URL: https://issues.apache.org/jira/browse/HDFS-3049
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 0.24.0

 Attachments: HDFS-3049.001.patch


 During the NameNode startup process, we load an image, and then apply edit 
 logs to it until we believe that we have all the latest changes.  
 Unfortunately, if there is an I/O error while reading any of these files, in 
 most cases, we simply abort the startup process.  We should try harder to 
 locate a readable edit log and/or image file.
 *There are three main use cases for this feature:*
 1. If the operating system does not honor fsync (usually due to a 
 misconfiguration), a file may end up in an inconsistent state.
 2. In certain older releases where we did not use fallocate() or similar to 
 pre-reserve blocks, a disk full condition may cause a truncated log in one 
 edit directory.
 3. There may be a bug in HDFS which results in some of the data directories 
 receiving corrupt data, but not all.  This is the least likely use case.
 *Proposed changes to normal NN startup*
 * We should try a different FSImage if we can't load the first one we try.
 * We should examine other FSEditLogs if we can't load the first one(s) we try.
 * We should fail if we can't find EditLogs that would bring us up to what we 
 believe is the latest transaction ID.
 Proposed changes to recovery mode NN startup:
 we should list out all the available storage directories and allow the 
 operator to select which one he wants to use.
 Something like this:
 {code}
 Multiple storage directories found.
 1. /foo/bar
 edits__curent__XYZ  size:213421345   md5:2345345
 image  size:213421345   md5:2345345
 2. /foo/baz
 edits__curent__XYZ  size:213421345   md5:2345345345
 image  size:213421345   md5:2345345
 Which one would you like to use? (1/2)
 {code}
 As usual in recovery mode, we want to be flexible about error handling.  In 
 this case, this means that we should NOT fail if we can't find EditLogs that 
 would bring us up to what we believe is the latest transaction ID.
 *Not addressed by this feature*
 This feature will not address the case where an attempt to access the 
 NameNode name directory or directories hangs because of an I/O error.  This 
 may happen, for example, when trying to load an image from a hard-mounted NFS 
 directory, when the NFS server has gone away.  Just as now, the operator will 
 have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-13 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Target Version/s: 2.0.0
   Fix Version/s: (was: 2.0.0)

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, 
 HDFS-3134.003.patch, HDFS-3134.004.patch, HDFS-3134.005.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-12 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Release Note: This is a new feature.  It is documented in 
hdfs_user_guide.xml.

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.1.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
 HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, 
 HDFS-3055-b1.006.patch, HDFS-3055-b1.007.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-12 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Release Note: This is a new feature.  It is documented in 
hdfs_user_guide.xml.

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.0.0

 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, 
 HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.043.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-12 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Attachment: HDFS-3134.004.patch

* add unit test

* make range check more flexible (adding upper as well as lower bound, make 
lower bound configurable)

* fix bug where we might not decode certain DelegationKey objects because we 
encoded them with length = -1 (i.e. no key)

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, 
 HDFS-3134.003.patch, HDFS-3134.004.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-11 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.007.patch

In the docs, refer to -force, not --force

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
 HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, 
 HDFS-3055-b1.006.patch, HDFS-3055-b1.007.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3169) TestFsck should test multiple -move operations in a row

2012-04-11 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3169:
---

Status: Patch Available  (was: Open)

 TestFsck should test multiple -move operations in a row
 ---

 Key: HDFS-3169
 URL: https://issues.apache.org/jira/browse/HDFS-3169
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3169.001.patch


 TestFsck should test multiple -move operations in a row.  Overall, it would 
 be nice to have more coverage on this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-11 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Attachment: HDFS-3134.003.patch

rebase on trunk

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch, 
 HDFS-3134.003.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3248) bootstrapstanby repeated twice in hdfs namenode usage message

2012-04-10 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3248:
---

Status: Patch Available  (was: Open)

 bootstrapstanby repeated twice in hdfs namenode usage message
 -

 Key: HDFS-3248
 URL: https://issues.apache.org/jira/browse/HDFS-3248
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3248.002.patch


 The HDFS usage message repeats bootstrapStandby twice.
 {code}
 Usage: java NameNode [-backup] | [-checkpoint] | [-format[-clusterid cid ]] | 
 [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint] | 
 [-bootstrapStandby] | [-initializeSharedEdits] | [-bootstrapStandby] | 
 [-recover [ -force ] ]
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3248) bootstrapstanby repeated twice in hdfs namenode usage message

2012-04-10 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3248:
---

Attachment: HDFS-3248.002.patch

 bootstrapstanby repeated twice in hdfs namenode usage message
 -

 Key: HDFS-3248
 URL: https://issues.apache.org/jira/browse/HDFS-3248
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3248.002.patch


 The HDFS usage message repeats bootstrapStandby twice.
 {code}
 Usage: java NameNode [-backup] | [-checkpoint] | [-format[-clusterid cid ]] | 
 [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint] | 
 [-bootstrapStandby] | [-initializeSharedEdits] | [-bootstrapStandby] | 
 [-recover [ -force ] ]
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-09 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.005.patch

* move askOperator to MetaRecoveryContext::editLogLoaderPrompt

* remove unecessary toString() call

* warn about losing data from your HDFS filesystem rather than losing data 
from your filesystem

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
 HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-09 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.006.patch

* TestNameNodeRecovery: use StringUtils instead of StringWriter to serialize 
exception

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
 HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, 
 HDFS-3055-b1.006.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: (was: HDFS-3004__namenode_recovery_tool.txt)

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004__namenode_recovery_tool.txt

* update design document

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.042.patch

* always prompt about possible data loss, even when -force is specified

* update hdfs_user_guide.xml so that it talks about -force

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Attachment: HDFS-3134.001.patch

* In the edit log loader, don't try to allocate arrays of negative length.  
Instead, throw an IOException.

* When deserializing a variable length number into a java integer, do not 
ignore problems resulting from truncation-- throw an IOException instead.

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Status: Patch Available  (was: Open)

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.004.patch

* update patch to reflect comments from HDFS-3004

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
 HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3134) harden edit log loader against malformed or malicious input

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3134:
---

Attachment: HDFS-3134.002.patch

 harden edit log loader against malformed or malicious input
 ---

 Key: HDFS-3134
 URL: https://issues.apache.org/jira/browse/HDFS-3134
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3134.001.patch, HDFS-3134.002.patch


 Currently, the edit log loader does not handle bad or malicious input 
 sensibly.
 We can often cause OutOfMemory exceptions, null pointer exceptions, or other 
 unchecked exceptions to be thrown by feeding the edit log loader bad input.  
 In some environments, an out of memory error can cause the JVM process to be 
 terminated.
 It's clear that we want these exceptions to be thrown as IOException instead 
 of as unchecked exceptions.  We also want to avoid out of memory situations.
 The main task here is to put a sensible upper limit on the lengths of arrays 
 and strings we allocate on command.  The other task is to try to avoid 
 creating unchecked exceptions (by dereferencing potentially-NULL pointers, 
 for example).  Instead, we should verify ahead of time and give a more 
 sensible error message that reflects the problem with the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-06 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.042.patch

* reposting so jenkins will test

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, 
 HDFS-3004.042.patch, HDFS-3004.042.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3206) oev: correctly serialize SetOwner operations in which the user is not changed

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3206:
---

Attachment: HDFS-3206.001.patch

* fix

 oev: correctly serialize SetOwner operations in which the user is not changed
 -

 Key: HDFS-3206
 URL: https://issues.apache.org/jira/browse/HDFS-3206
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3206.001.patch


 SetOwner operations can change both the user and group which a file or 
 directory belongs to, or just one of those.  Currently, in the XML 
 serialization/deserialization code, we don't handle the case where just the 
 group is set, not the user.  We should handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3206:
---

Description: 
* SetOwner operations can change both the user and group which a file or 
directory belongs to, or just one of those.  Currently, in the XML 
serialization/deserialization code, we don't handle the case where just the 
group is set, not the user.  We should handle this case.

* consistently serialize generation stamp as GENSTAMP.

  was:SetOwner operations can change both the user and group which a file or 
directory belongs to, or just one of those.  Currently, in the XML 
serialization/deserialization code, we don't handle the case where just the 
group is set, not the user.  We should handle this case.

Summary: oev: miscellaneous xml cleanups  (was: oev: correctly 
serialize SetOwner operations in which the user is not changed)

 oev: miscellaneous xml cleanups
 ---

 Key: HDFS-3206
 URL: https://issues.apache.org/jira/browse/HDFS-3206
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3206.001.patch


 * SetOwner operations can change both the user and group which a file or 
 directory belongs to, or just one of those.  Currently, in the XML 
 serialization/deserialization code, we don't handle the case where just the 
 group is set, not the user.  We should handle this case.
 * consistently serialize generation stamp as GENSTAMP.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3206:
---

Status: Patch Available  (was: Open)

 oev: miscellaneous xml cleanups
 ---

 Key: HDFS-3206
 URL: https://issues.apache.org/jira/browse/HDFS-3206
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3206.001.patch


 * SetOwner operations can change both the user and group which a file or 
 directory belongs to, or just one of those.  Currently, in the XML 
 serialization/deserialization code, we don't handle the case where just the 
 group is set, not the user.  We should handle this case.
 * consistently serialize generation stamp as GENSTAMP.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.003.patch

* rebase on branch-1

todd: yes, this is up to date, and I've run the following tests:
TestCheckpoint,
TestEditLog,
TestNameNodeRecovery,
TestEditLogLoading,
TestNameNodeMXBean,
TestSaveNamespace,
TestSecurityTokenEditLog,
TestStorageDirectoryFailure,
TestStorageRestore

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
 HDFS-3055-b1.003.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.039.patch

* rebase on trunk

* rename RecoveryContext to MetaRecoveryContext

* rename -autoChooseDefault to -noPrompt

* EditLogInputException.java: remove pointless whitespace change

* some whitespace and punctuation improvements to the recovery prompt text.

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3206) oev: miscellaneous xml cleanups

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3206:
---

Attachment: HDFS-3206.002.patch

* put OP_END_LOG_SEGMENT at the end of the edit log, which it would be in a 
real edit log.

 oev: miscellaneous xml cleanups
 ---

 Key: HDFS-3206
 URL: https://issues.apache.org/jira/browse/HDFS-3206
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3206.001.patch, HDFS-3206.002.patch


 * SetOwner operations can change both the user and group which a file or 
 directory belongs to, or just one of those.  Currently, in the XML 
 serialization/deserialization code, we don't handle the case where just the 
 group is set, not the user.  We should handle this case.
 * consistently serialize generation stamp as GENSTAMP.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.040.patch

* update findbugsExcludeFile.xml to reflect the fact that RecoveryContext is 
now called MetaRecoveryContext

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-05 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.041.patch

* implement -force

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode

2012-04-04 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.019.patch

* fix error handling bug that could lead to open files getting leaked 
(theoretically)

* suppress javadoc warnings resulting from com.sun.* API use

 rework OEV to share more code with the NameNode
 ---

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch, 
 HDFS-3050.018.patch, HDFS-3050.019.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.  By using the existing FSEditLogLoader code to 
 load edits in OEV, we can avoid having to update two places when the format 
 changes.
 We should not put opcode checksums into the XML, because they are a 
 serialization detail, not related to what the data is what we're storing.  
 This will also make it possible to modify the XML file and translate this 
 modified file back to a binary edits log file.
 Finally, this changes introduces --fix-txids.  When OEV is passed this flag, 
 it will close gaps in the transaction log by modifying the sequence numbers.  
 This is useful if you want to modify the edit log XML (say, by removing a 
 transaction), and transform the modified XML back into a valid binary edit 
 log file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode

2012-04-04 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.020.patch

* update OK_JAVADOC_WARNINGS

 rework OEV to share more code with the NameNode
 ---

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch, 
 HDFS-3050.018.patch, HDFS-3050.019.patch, HDFS-3050.020.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.  By using the existing FSEditLogLoader code to 
 load edits in OEV, we can avoid having to update two places when the format 
 changes.
 We should not put opcode checksums into the XML, because they are a 
 serialization detail, not related to what the data is what we're storing.  
 This will also make it possible to modify the XML file and translate this 
 modified file back to a binary edits log file.
 Finally, this changes introduces --fix-txids.  When OEV is passed this flag, 
 it will close gaps in the transaction log by modifying the sequence numbers.  
 This is useful if you want to modify the edit log XML (say, by removing a 
 transaction), and transform the modified XML back into a valid binary edit 
 log file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-04 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.038.patch

* rebase

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-04-03 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.002.patch

* add unit test

* some fixes to NN unclean shutdown (to allow unit test to work)

* better error reporting for the branch-1 edit log stuff (print out the offset 
when we encounter a problem)

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1378) Edit log replay should track and report file offsets in case of errors

2012-04-03 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-1378:
---

Attachment: HDFS-1378-b1.002.patch

* port to branch-1

 Edit log replay should track and report file offsets in case of errors
 --

 Key: HDFS-1378
 URL: https://issues.apache.org/jira/browse/HDFS-1378
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe
 Fix For: 0.23.0

 Attachments: HDFS-1378-b1.002.patch, hdfs-1378-branch20.txt, 
 hdfs-1378.0.patch, hdfs-1378.1.patch, hdfs-1378.2.txt


 Occasionally there are bugs or operational mistakes that result in corrupt 
 edit logs which I end up having to repair by hand. In these cases it would be 
 very handy to have the error message also print out the file offsets of the 
 last several edit log opcodes so it's easier to find the right place to edit 
 in the OP_INVALID marker. We could also use this facility to provide a rough 
 estimate of how far along edit log replay the NN is during startup (handy 
 when a 2NN has died and replay takes a while)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1378) Edit log replay should track and report file offsets in case of errors

2012-04-03 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-1378:
---

Attachment: HDFS-1378-b1.003.patch

* include bug fix from revised patch

* backport unit test as well

 Edit log replay should track and report file offsets in case of errors
 --

 Key: HDFS-1378
 URL: https://issues.apache.org/jira/browse/HDFS-1378
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe
 Fix For: 0.23.0

 Attachments: HDFS-1378-b1.002.patch, HDFS-1378-b1.003.patch, 
 hdfs-1378-branch20.txt, hdfs-1378.0.patch, hdfs-1378.1.patch, hdfs-1378.2.txt


 Occasionally there are bugs or operational mistakes that result in corrupt 
 edit logs which I end up having to repair by hand. In these cases it would be 
 very handy to have the error message also print out the file offsets of the 
 last several edit log opcodes so it's easier to find the right place to edit 
 in the OP_INVALID marker. We could also use this facility to provide a rough 
 estimate of how far along edit log replay the NN is during startup (handy 
 when a 2NN has died and replay takes a while)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1378) Edit log replay should track and report file offsets in case of errors

2012-04-03 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-1378:
---

Attachment: HDFS-1378-b1.004.patch

* add unit test

 Edit log replay should track and report file offsets in case of errors
 --

 Key: HDFS-1378
 URL: https://issues.apache.org/jira/browse/HDFS-1378
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe
 Fix For: 0.23.0

 Attachments: HDFS-1378-b1.002.patch, HDFS-1378-b1.003.patch, 
 HDFS-1378-b1.004.patch, hdfs-1378-branch20.txt, hdfs-1378.0.patch, 
 hdfs-1378.1.patch, hdfs-1378.2.txt


 Occasionally there are bugs or operational mistakes that result in corrupt 
 edit logs which I end up having to repair by hand. In these cases it would be 
 very handy to have the error message also print out the file offsets of the 
 last several edit log opcodes so it's easier to find the right place to edit 
 in the OP_INVALID marker. We could also use this facility to provide a rough 
 estimate of how far along edit log replay the NN is during startup (handy 
 when a 2NN has died and replay takes a while)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode

2012-04-03 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.018.patch

* rebase on latest trunk

 rework OEV to share more code with the NameNode
 ---

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch, 
 HDFS-3050.018.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.  By using the existing FSEditLogLoader code to 
 load edits in OEV, we can avoid having to update two places when the format 
 changes.
 We should not put opcode checksums into the XML, because they are a 
 serialization detail, not related to what the data is what we're storing.  
 This will also make it possible to modify the XML file and translate this 
 modified file back to a binary edits log file.
 Finally, this changes introduces --fix-txids.  When OEV is passed this flag, 
 it will close gaps in the transaction log by modifying the sequence numbers.  
 This is useful if you want to modify the edit log XML (say, by removing a 
 transaction), and transform the modified XML back into a valid binary edit 
 log file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode

2012-04-02 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.017.patch

* rebase on trunk

 rework OEV to share more code with the NameNode
 ---

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch, HDFS-3050.016.patch, HDFS-3050.017.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.  By using the existing FSEditLogLoader code to 
 load edits in OEV, we can avoid having to update two places when the format 
 changes.
 We should not put opcode checksums into the XML, because they are a 
 serialization detail, not related to what the data is what we're storing.  
 This will also make it possible to modify the XML file and translate this 
 modified file back to a binary edits log file.
 Finally, this changes introduces --fix-txids.  When OEV is passed this flag, 
 it will close gaps in the transaction log by modifying the sequence numbers.  
 This is useful if you want to modify the edit log XML (say, by removing a 
 transaction), and transform the modified XML back into a valid binary edit 
 log file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-04-02 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.037.patch

* style fixes

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3181) org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart fails intermittently

2012-04-02 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3181:
---

Attachment: testOut.txt

exception, standard output, etc of a failure

 org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart
  fails intermittently
 

 Key: HDFS-3181
 URL: https://issues.apache.org/jira/browse/HDFS-3181
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
 Attachments: testOut.txt


 org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart
  seems to be failing intermittently on jenkins.
 {code}
 org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart
 Failing for the past 1 build (Since Failed#2163 )
 Took 8.4 sec.
 Error Message
 Lease mismatch on /hardLeaseRecovery owned by HDFS_NameNode but is accessed 
 by DFSClient_NONMAPREDUCE_1147689755_1  at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891)  at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661)  at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657)  at 
 java.security.AccessController.doPrivileged(Native Method)  at 
 javax.security.auth.Subject.doAs(Subject.java:396)  at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655) 
 Stacktrace
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /hardLeaseRecovery owned by HDFS_NameNode but is accessed by 
 DFSClient_NONMAPREDUCE_1147689755_1
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655)
   at org.apache.hadoop.ipc.Client.call(Client.java:1159)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:185)
   at $Proxy15.getAdditionalDatanode(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
   at $Proxy15.getAdditionalDatanode(Unknown Source)
   at 
 

[jira] [Updated] (HDFS-3181) testHardLeaseRecoveryAfterNameNodeRestart fails intermittently

2012-04-02 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3181:
---

Summary: testHardLeaseRecoveryAfterNameNodeRestart fails intermittently  
(was: 
org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart
 fails intermittently)

 testHardLeaseRecoveryAfterNameNodeRestart fails intermittently
 --

 Key: HDFS-3181
 URL: https://issues.apache.org/jira/browse/HDFS-3181
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
 Attachments: testOut.txt


 org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart
  seems to be failing intermittently on jenkins.
 {code}
 org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryAfterNameNodeRestart
 Failing for the past 1 build (Since Failed#2163 )
 Took 8.4 sec.
 Error Message
 Lease mismatch on /hardLeaseRecovery owned by HDFS_NameNode but is accessed 
 by DFSClient_NONMAPREDUCE_1147689755_1  at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891)  at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661)  at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657)  at 
 java.security.AccessController.doPrivileged(Native Method)  at 
 javax.security.auth.Subject.doAs(Subject.java:396)  at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655) 
 Stacktrace
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /hardLeaseRecovery owned by HDFS_NameNode but is accessed by 
 DFSClient_NONMAPREDUCE_1147689755_1
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2076)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2051)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:1983)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:492)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:311)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42604)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:417)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:891)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1661)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1657)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1655)
   at org.apache.hadoop.ipc.Client.call(Client.java:1159)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:185)
   at $Proxy15.getAdditionalDatanode(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
   at $Proxy15.getAdditionalDatanode(Unknown Source)
   at 
 

[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default

2012-03-30 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3044:
---

Attachment: HDFS-3044-b1.004.patch

* remember to run fsck -delete before checking to see if the file is really 
deleted (d'oh!)

* add test that running fsck -move a few times in a row has no harmful effects

 fsck move should be non-destructive by default
 --

 Key: HDFS-3044
 URL: https://issues.apache.org/jira/browse/HDFS-3044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Eli Collins
Assignee: Colin Patrick McCabe
 Fix For: 1.1.0, 2.0.0

 Attachments: HDFS-3044-b1.002.patch, HDFS-3044-b1.004.patch, 
 HDFS-3044.002.patch, HDFS-3044.003.patch


 The fsck move behavior in the code and originally articulated in HADOOP-101 
 is:
 {quote}Current failure modes for DFS involve blocks that are completely 
 missing. The only way to fix them would be to recover chains of blocks and 
 put them into lost+found{quote}
 A directory is created with the file name, the blocks that are accessible are 
 created as individual files in this directory, then the original file is 
 removed. 
 I suspect the rationale for this behavior was that you can't use files that 
 are missing locations, and copying the block as files at least makes part of 
 the files accessible. However this behavior can also result in permanent 
 dataloss. Eg:
 - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
 startup, files with blocks where all replicas are on these set of datanodes 
 are marked corrupt
 - Admin does fsck move, which deletes the corrupt files, saves whatever 
 blocks were available
 - The HW issues with datanodes are resolved, they are started and join the 
 cluster. The NN tells them to delete their blocks for the corrupt files since 
 the file was deleted. 
 I think we should:
 - Make fsck move non-destructive by default (eg just does a move into 
 lost+found)
 - Make the destructive behavior optional (eg --destructive so admins think 
 about what they're doing)
 - Provide better sanity checks and warnings, eg if you're running fsck and 
 not all the slaves have checked in (if using dfs.hosts) then fsck should 
 print a warning indicating this that an admin should have to override if they 
 want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode

2012-03-30 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Description: 
Current, OEV (the offline edits viewer) re-implements all of the opcode parsing 
logic found in the NameNode.  This duplicated code creates a maintenance burden 
for us.

OEV should be refactored to simply use the normal EditLog parsing code, rather 
than rolling its own.  By using the existing FSEditLogLoader code to load edits 
in OEV, we can avoid having to update two places when the format changes.

We should not put opcode checksums into the XML, because they are a 
serialization detail, not related to what the data is what we're storing.  This 
will also make it possible to modify the XML file and translate this modified 
file back to a binary edits log file.

Finally, this changes introduces --fix-txids.  When OEV is passed this flag, it 
will close gaps in the transaction log by modifying the sequence numbers.  This 
is useful if you want to modify the edit log XML (say, by removing a 
transaction), and transform the modified XML back into a valid binary edit log 
file.

  was:
Current, OEV (the offline edits viewer) re-implements all of the opcode parsing 
logic found in the NameNode.  This duplicated code creates a maintenance burden 
for us.

OEV should be refactored to simply use the normal EditLog parsing code, rather 
than rolling its own.

Summary: rework OEV to share more code with the NameNode  (was: 
refactor OEV to share more code with the NameNode)

 rework OEV to share more code with the NameNode
 ---

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.  By using the existing FSEditLogLoader code to 
 load edits in OEV, we can avoid having to update two places when the format 
 changes.
 We should not put opcode checksums into the XML, because they are a 
 serialization detail, not related to what the data is what we're storing.  
 This will also make it possible to modify the XML file and translate this 
 modified file back to a binary edits log file.
 Finally, this changes introduces --fix-txids.  When OEV is passed this flag, 
 it will close gaps in the transaction log by modifying the sequence numbers.  
 This is useful if you want to modify the edit log XML (say, by removing a 
 transaction), and transform the modified XML back into a valid binary edit 
 log file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) rework OEV to share more code with the NameNode

2012-03-30 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.016.patch

* rebase against current trunk

* posted improved patch description and name

 rework OEV to share more code with the NameNode
 ---

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch, HDFS-3050.016.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.  By using the existing FSEditLogLoader code to 
 load edits in OEV, we can avoid having to update two places when the format 
 changes.
 We should not put opcode checksums into the XML, because they are a 
 serialization detail, not related to what the data is what we're storing.  
 This will also make it possible to modify the XML file and translate this 
 modified file back to a binary edits log file.
 Finally, this changes introduces --fix-txids.  When OEV is passed this flag, 
 it will close gaps in the transaction log by modifying the sequence numbers.  
 This is useful if you want to modify the edit log XML (say, by removing a 
 transaction), and transform the modified XML back into a valid binary edit 
 log file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-30 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.036.patch

* rebase on trunk

* slight cleanup of EditLogBackupInputStream::nextValidOp()

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-29 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.015.patch

* set 2-space indentation in output XML to match old XML

* add Javadoc comments to XMLUtils

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch, 
 HDFS-3050.015.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default

2012-03-29 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3044:
---

Attachment: HDFS-3050-b1.001.patch

* port to branch-1

 fsck move should be non-destructive by default
 --

 Key: HDFS-3044
 URL: https://issues.apache.org/jira/browse/HDFS-3044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Eli Collins
Assignee: Colin Patrick McCabe
 Fix For: 2.0.0

 Attachments: HDFS-3044.002.patch, HDFS-3044.003.patch, 
 HDFS-3050-b1.001.patch


 The fsck move behavior in the code and originally articulated in HADOOP-101 
 is:
 {quote}Current failure modes for DFS involve blocks that are completely 
 missing. The only way to fix them would be to recover chains of blocks and 
 put them into lost+found{quote}
 A directory is created with the file name, the blocks that are accessible are 
 created as individual files in this directory, then the original file is 
 removed. 
 I suspect the rationale for this behavior was that you can't use files that 
 are missing locations, and copying the block as files at least makes part of 
 the files accessible. However this behavior can also result in permanent 
 dataloss. Eg:
 - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
 startup, files with blocks where all replicas are on these set of datanodes 
 are marked corrupt
 - Admin does fsck move, which deletes the corrupt files, saves whatever 
 blocks were available
 - The HW issues with datanodes are resolved, they are started and join the 
 cluster. The NN tells them to delete their blocks for the corrupt files since 
 the file was deleted. 
 I think we should:
 - Make fsck move non-destructive by default (eg just does a move into 
 lost+found)
 - Make the destructive behavior optional (eg --destructive so admins think 
 about what they're doing)
 - Provide better sanity checks and warnings, eg if you're running fsck and 
 not all the slaves have checked in (if using dfs.hosts) then fsck should 
 print a warning indicating this that an admin should have to override if they 
 want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default

2012-03-29 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3044:
---

Attachment: HDFS-3044-b1.002.patch

* fix patch name

 fsck move should be non-destructive by default
 --

 Key: HDFS-3044
 URL: https://issues.apache.org/jira/browse/HDFS-3044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Eli Collins
Assignee: Colin Patrick McCabe
 Fix For: 2.0.0

 Attachments: HDFS-3044-b1.002.patch, HDFS-3044.002.patch, 
 HDFS-3044.003.patch


 The fsck move behavior in the code and originally articulated in HADOOP-101 
 is:
 {quote}Current failure modes for DFS involve blocks that are completely 
 missing. The only way to fix them would be to recover chains of blocks and 
 put them into lost+found{quote}
 A directory is created with the file name, the blocks that are accessible are 
 created as individual files in this directory, then the original file is 
 removed. 
 I suspect the rationale for this behavior was that you can't use files that 
 are missing locations, and copying the block as files at least makes part of 
 the files accessible. However this behavior can also result in permanent 
 dataloss. Eg:
 - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
 startup, files with blocks where all replicas are on these set of datanodes 
 are marked corrupt
 - Admin does fsck move, which deletes the corrupt files, saves whatever 
 blocks were available
 - The HW issues with datanodes are resolved, they are started and join the 
 cluster. The NN tells them to delete their blocks for the corrupt files since 
 the file was deleted. 
 I think we should:
 - Make fsck move non-destructive by default (eg just does a move into 
 lost+found)
 - Make the destructive behavior optional (eg --destructive so admins think 
 about what they're doing)
 - Provide better sanity checks and warnings, eg if you're running fsck and 
 not all the slaves have checked in (if using dfs.hosts) then fsck should 
 print a warning indicating this that an admin should have to override if they 
 want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default

2012-03-29 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3044:
---

Attachment: (was: HDFS-3050-b1.001.patch)

 fsck move should be non-destructive by default
 --

 Key: HDFS-3044
 URL: https://issues.apache.org/jira/browse/HDFS-3044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Eli Collins
Assignee: Colin Patrick McCabe
 Fix For: 2.0.0

 Attachments: HDFS-3044-b1.002.patch, HDFS-3044.002.patch, 
 HDFS-3044.003.patch


 The fsck move behavior in the code and originally articulated in HADOOP-101 
 is:
 {quote}Current failure modes for DFS involve blocks that are completely 
 missing. The only way to fix them would be to recover chains of blocks and 
 put them into lost+found{quote}
 A directory is created with the file name, the blocks that are accessible are 
 created as individual files in this directory, then the original file is 
 removed. 
 I suspect the rationale for this behavior was that you can't use files that 
 are missing locations, and copying the block as files at least makes part of 
 the files accessible. However this behavior can also result in permanent 
 dataloss. Eg:
 - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
 startup, files with blocks where all replicas are on these set of datanodes 
 are marked corrupt
 - Admin does fsck move, which deletes the corrupt files, saves whatever 
 blocks were available
 - The HW issues with datanodes are resolved, they are started and join the 
 cluster. The NN tells them to delete their blocks for the corrupt files since 
 the file was deleted. 
 I think we should:
 - Make fsck move non-destructive by default (eg just does a move into 
 lost+found)
 - Make the destructive behavior optional (eg --destructive so admins think 
 about what they're doing)
 - Provide better sanity checks and warnings, eg if you're running fsck and 
 not all the slaves have checked in (if using dfs.hosts) then fsck should 
 print a warning indicating this that an admin should have to override if they 
 want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-29 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.035.patch

* TestNameNodeRecovery: don't need data nodes for this test

* TestNameNodeRecovery: use set rather than array


 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-28 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.034.patch

* address Todd's suggestions

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-28 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.014.patch

* fix unit test

* enable reading edit logs from XML 

* add -f / -fix-txids option, which makes oev close any holes in the 
transaction ID series.

* Editing the XML no longer allows you to manually recalculate checksums for 
the edited opcode.

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch, HDFS-3050.014.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-26 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.012.patch

* fix findbugs warnings

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch, HDFS-3050.012.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-26 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.033.patch

* rebase on trunk

* fix some resource leaks in TestNameNodeRecovery

* some whitespace and logging cleanups

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-23 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.032.patch

* hdfs_user_guide.xml: fix paragraph breaks

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3055) Implement recovery mode for branch-1

2012-03-23 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3055:
---

Attachment: HDFS-3055-b1.001.patch

* initial version

 Implement recovery mode for branch-1
 

 Key: HDFS-3055
 URL: https://issues.apache.org/jira/browse/HDFS-3055
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.0

 Attachments: HDFS-3055-b1.001.patch


 Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.027.patch

* fix bug uncovered by jenkins

* add a little more debug in TestNameNodeRecovery

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.009.patch

rebase

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.009.patch

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: (was: HDFS-3050.009.patch)

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.010.patch

remove all changes to common

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3129) NetworkTopology: add test that getLeaf should check for invalid topologies

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3129:
---

Attachment: HDFS-3129.001.patch

 NetworkTopology: add test that getLeaf should check for invalid topologies
 --

 Key: HDFS-3129
 URL: https://issues.apache.org/jira/browse/HDFS-3129
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3129.001.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.011.patch

* fix broken import lines

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch, HDFS-3050.009.patch, HDFS-3050.010.patch, 
 HDFS-3050.011.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.029.patch

* rebase on trunk

* fix findbugs suppression

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3129) NetworkTopology: add test that getLeaf should check for invalid topologies

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3129:
---

Attachment: HDFS-3129-b1.001.patch

* branch-1 version

 NetworkTopology: add test that getLeaf should check for invalid topologies
 --

 Key: HDFS-3129
 URL: https://issues.apache.org/jira/browse/HDFS-3129
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3129-b1.001.patch, HDFS-3129.001.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.030.patch

fix bug in prompting

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-22 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.031.patch

testing done: manually corrupted an edit log, recovered it with recovery mode.

patch update: whitespace tweak for recovery prompt

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-21 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.024.patch

* HdfsServerConstants: use equals for string equality

* fix bugs with the upgrade process

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-21 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.026.patch

rebase on latest trunk

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.020.patch

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Status: Open  (was: Patch Available)

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Status: Patch Available  (was: Open)

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.007.patch

rebase on latest trunk

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: (was: HDFS-3050.008.patch)

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3050) refactor OEV to share more code with the NameNode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3050:
---

Attachment: HDFS-3050.008.patch
HDFS-3050.008.patch

 refactor OEV to share more code with the NameNode
 -

 Key: HDFS-3050
 URL: https://issues.apache.org/jira/browse/HDFS-3050
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-3050.006.patch, HDFS-3050.007.patch, 
 HDFS-3050.008.patch


 Current, OEV (the offline edits viewer) re-implements all of the opcode 
 parsing logic found in the NameNode.  This duplicated code creates a 
 maintenance burden for us.
 OEV should be refactored to simply use the normal EditLog parsing code, 
 rather than rolling its own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3044:
---

Attachment: HDFS-3044.003.patch

address eli's comments

 fsck move should be non-destructive by default
 --

 Key: HDFS-3044
 URL: https://issues.apache.org/jira/browse/HDFS-3044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Eli Collins
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3044.002.patch, HDFS-3044.003.patch


 The fsck move behavior in the code and originally articulated in HADOOP-101 
 is:
 {quote}Current failure modes for DFS involve blocks that are completely 
 missing. The only way to fix them would be to recover chains of blocks and 
 put them into lost+found{quote}
 A directory is created with the file name, the blocks that are accessible are 
 created as individual files in this directory, then the original file is 
 removed. 
 I suspect the rationale for this behavior was that you can't use files that 
 are missing locations, and copying the block as files at least makes part of 
 the files accessible. However this behavior can also result in permanent 
 dataloss. Eg:
 - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
 startup, files with blocks where all replicas are on these set of datanodes 
 are marked corrupt
 - Admin does fsck move, which deletes the corrupt files, saves whatever 
 blocks were available
 - The HW issues with datanodes are resolved, they are started and join the 
 cluster. The NN tells them to delete their blocks for the corrupt files since 
 the file was deleted. 
 I think we should:
 - Make fsck move non-destructive by default (eg just does a move into 
 lost+found)
 - Make the destructive behavior optional (eg --destructive so admins think 
 about what they're doing)
 - Provide better sanity checks and warnings, eg if you're running fsck and 
 not all the slaves have checked in (if using dfs.hosts) then fsck should 
 print a warning indicating this that an admin should have to override if they 
 want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.022.patch

* remove some unecessary whitespace changes

* re-introduce EditLogInputException

* edit log input stream: change API as we discussed.

* FSEditLogLoader: re-organize this file.  Fix some corner cases relating to 
out-of-order transaction IDs

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-20 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.023.patch

* OpInstanceCache needs to be thread-local to work correctly

* update exception text regex in TestFSEditLogLoader

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-17 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.019.patch

* remove obsolete references to JournalStream in comments

* rename resync to skipBrokenEdits

* rename -f to -chooseFirst

* fix EditLogInputStream comments

* ELIS::rewind - ELIS::putOp

* FSEditLogLoader: fix a case where the numEdits return could be incorrect

* FSEditLogLoader: improve handling of missing transactions

* fix some cases in which we had been assuming that there are no gaps in the 
transaction log stream.  Try to avoid doing fancy arithmetic by counting the 
number of edits we've decoded, etc.  Instead, just rely on looking at the 
transaction ID of the last edit we decoded.

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-16 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.016.patch

* rebase on latest trunk

* make nextOp protected in all subclasses of ELIS

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-16 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.017.patch

add a section about recovery to hdfs_user_guide.xml

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-16 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.018.patch

* move to using a per-Reader or per-FSEditLog operation cache, rather than a 
purely per-thread operation cache.  Now that we are caching a single opcode 
inside the FSInputStream, we definitely don't want these instances shared 
between threads.

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-15 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.012.patch

* make more exceptions skippable

* rename StartupOption.ALWAYS_CHOOSE_YES to StartupOption.ALWAYS_CHOOSE_FIRST, 
to better reflect what it does.

* refactor EditLogInputStream a bit


 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-15 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.013.patch

* remove SkippableEditLogException, as it turned out not to be necessary

* test skipping in EditLogInputStream

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-03-15 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: HDFS-3004.015.patch

fix small bug in opcode skipping

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004__namenode_recovery_tool.txt


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3044) fsck move should be non-destructive by default

2012-03-12 Thread Colin Patrick McCabe (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3044:
---

Attachment: (was: HDFS-3044.001.patch)

 fsck move should be non-destructive by default
 --

 Key: HDFS-3044
 URL: https://issues.apache.org/jira/browse/HDFS-3044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Eli Collins
Assignee: Colin Patrick McCabe
 Attachments: HDFS-3044.002.patch


 The fsck move behavior in the code and originally articulated in HADOOP-101 
 is:
 {quote}Current failure modes for DFS involve blocks that are completely 
 missing. The only way to fix them would be to recover chains of blocks and 
 put them into lost+found{quote}
 A directory is created with the file name, the blocks that are accessible are 
 created as individual files in this directory, then the original file is 
 removed. 
 I suspect the rationale for this behavior was that you can't use files that 
 are missing locations, and copying the block as files at least makes part of 
 the files accessible. However this behavior can also result in permanent 
 dataloss. Eg:
 - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
 startup, files with blocks where all replicas are on these set of datanodes 
 are marked corrupt
 - Admin does fsck move, which deletes the corrupt files, saves whatever 
 blocks were available
 - The HW issues with datanodes are resolved, they are started and join the 
 cluster. The NN tells them to delete their blocks for the corrupt files since 
 the file was deleted. 
 I think we should:
 - Make fsck move non-destructive by default (eg just does a move into 
 lost+found)
 - Make the destructive behavior optional (eg --destructive so admins think 
 about what they're doing)
 - Provide better sanity checks and warnings, eg if you're running fsck and 
 not all the slaves have checked in (if using dfs.hosts) then fsck should 
 print a warning indicating this that an admin should have to override if they 
 want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >