Re: critical name node problem
Allen Wittenauer wrote: On 9/5/08 5:53 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. We clearly need to change the how to configure docs to make sure people put at least two directories on two different storage systems for the dfs.name.dir . This problem seems to happen quite often, and having two+ dirs helps protect against it. We recently had one of the disks on one of our copies go bad. The system kept going just fine until we had a chance to reconfig the name node. That said, I've just HADOOP-4080 to help alert admins in these situations. that and HADOOP-4081. Apache Axis has this production/development switch; in develop mode it sends stack traces over the wire and is generally more forgiving. By default it assumes you are in production rather than development, so you have to explicitly flip the switch to get slighly reduced security. Hadoop could have something similar, where if the production flag is set, the cluster would simply refuse to come up if it felt the configuration wasn't robust enough.
critical name node problem
Hi! My namenode has run out of space, and now I'm getting the following: 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because it does not exist 08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000 08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196 hadoop-0.17.1 btw. What do I do now? Andreas signature.asc Description: This is a digitally signed message part.
Re: critical name node problem
Ok, googling a little bit around, the solution seems to either delete the edits file, which in my case would be non-cool (24MB worth of edits in there), or truncate it correctly. So I used the following script to figure out how much data needs to be dropped: LEN=25497570 while true do dd if=edits.org of=edits bs=$LEN count=1 time hadoop namenode if [[ $? -ne 255 ]] then echo $LEN seems to have worked. exit 0 fi LEN=$(expr $LEN - 1) done Guess something like this might make sense to add http://wiki.apache.org/hadoop/TroubleShooting not everyone will be able to figure out how to get rid of the last incomplete record. Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. Andreas On Friday 05 September 2008 13:30:34 Andreas Kostyrka wrote: Hi! My namenode has run out of space, and now I'm getting the following: 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because it does not exist 08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000 08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196 hadoop-0.17.1 btw. What do I do now? Andreas signature.asc Description: This is a digitally signed message part.
Re: critical name node problem
On 9/5/08 5:53 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. We clearly need to change the how to configure docs to make sure people put at least two directories on two different storage systems for the dfs.name.dir . This problem seems to happen quite often, and having two+ dirs helps protect against it. We recently had one of the disks on one of our copies go bad. The system kept going just fine until we had a chance to reconfig the name node. That said, I've just HADOOP-4080 to help alert admins in these situations.