Re: Namenode failures

Robert Dyer Sun, 17 Feb 2013 14:51:06 -0800

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <donta...@gmail.com> wrote:


> Hello Robert,
>
>          It seems that your edit logs and fsimage have got
> corrupted somehow. It looks somewhat similar to this one
> https://issues.apache.org/jira/browse/HDFS-686
>

Similar, but the trace is different.


> Have you made any changes to the 'dfs.name.dir' directory
> lately?
>

No.


> Do you have enough space where metadata is getting
> stored?
>

Yes.  All 3 locations have plenty of space (hundreds of GB).


> You can make use of offine image viewer to diagnose
> the fsimage file.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <psyb...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <psyb...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <psyb...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rd...@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rd...@iastate.edu
>>
>
>


-- 

Robert Dyer
rd...@iastate.edu

Re: Namenode failures

Reply via email to