Re: my hadoop cluster namenode crashed after modifying the timestamp in some of the nodes

Jameson Li Wed, 16 Feb 2011 02:02:27 -0800

I have updated my test cluster to the old state before patching.
And everything goes normal.


Thanks.


2011/2/15 Jameson Li <hovlj...@gmail.com>:
> Hi Todd,
> Thanks very much. I think you are really right.
> I had used the hadoop-0.20-append patchs that is mentioned
> here:http://github.com/lenn0x/Hadoop-Append
>
> After reading the patch:0002-HDFS-278.patch , I found that the file
> "src/hdfs/org/apache/hadoop/hdfs/DFSClient.java" in my cluster does not
> contain these lines:
>
> this.maxBlockAcquireFailures =
>                            conf.getInt("dfs.client.max.block.acquire.failures",
>                                        MAX_BLOCK_ACQUIRE_FAILURES);
>
> It just looks like this:
>    this.maxBlockAcquireFailures = getMaxBlockAcquireFailures(conf);
>
> So I changed the 0002-HDFS-278.patch , and the diff between the
> origin 0002-HDFS-278.patch and the new patch after my change is:
> diff 0002-HDFS-278.patch ../hadoop-new/patch-origion/0002-HDFS-278.patch
> 0a1,10
>> From 56463073cf051f1e11b4d3921542979e53daead4 Mon Sep 17 00:00:00 2001
>> From: Chris Goffinet <c...@chrisgoffinet.com>
>> Date: Mon, 20 Jul 2009 17:20:13 -0700
>> Subject: [PATCH 2/4] HDFS-278
>>
>> ---
>>  src/hdfs/org/apache/hadoop/hdfs/DFSClient.java |   70
>> ++++++++++++++++++++++--
>>  1 files changed, 64 insertions(+), 6 deletions(-)
>>
>> diff --git a/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
>> b/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
> 2,3c12,13
> < --- src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
> < +++ src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
> ---
>> --- a/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
>> +++ b/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
> 19,20c29,32
> < @@ -188,5 +192,7 @@ public class DFSClient implements FSConstants,
> java.io.Closeable {
> <      this.maxBlockAcquireFailures = getMaxBlockAcquireFailures(conf);
> ---
>> @@ -167,7 +171,9 @@ public class DFSClient implements FSConstants,
>> java.io.Closeable {
>>      this.maxBlockAcquireFailures =
>>
>>  conf.getInt("dfs.client.max.block.acquire.failures",
>>                                        MAX_BLOCK_ACQUIRE_FAILURES);
> 118a131,133
>> --
>> 1.6.3.1
>>
> Did I miss some of the patchs about hadoop-0.20-append?
> How could I recover my NN and  let it work that I can export the data?
> 2011/2/14 Todd Lipcon <t...@cloudera.com>
>>
>> Hi Jameson,
>> My first instinct is that you have an incomplete patch series for hdfs
>> append, and that's what caused your problem. There were many bug fixes along
>> the way for hadoop-0.20-append and maybe you've missed some in your manually
>> patched build.
>> -Todd
>>
>> On Mon, Feb 14, 2011 at 5:49 AM, Jameson Li <hovlj...@gmail.com> wrote:
>>>
>>> Hi ,
>>>
>>> My hadoop version is basic on hadoop 0.20.2 realase, patched
>>> HADOOP-4675,5745,MAPREDUCE-1070,551,1089 (support
>>> ganglia31,fairscheduler preemption,hdfs append), and patched
>>> HADOOP-6099,HDFS-278,Patches-from-Dhruba-Borthakur,HDFS-200 (support
>>> scribe).
>>>
>>> Last Friday I found that some of my test hadoop cluster nodes's time
>>> is not in the normal state, they are some number of hours beyond the
>>> normal time.
>>> So I run the next command, and add it to the crontab job.
>>> /usr/bin/rdate -s time-b.nist.gov
>>>
>>> And then my hadoop cluster namenode crashed, after my restarting the
>>> namenode.
>>> And I don't know whether it is relationed by modifying the time.
>>>
>>> The error log:
>>> 2011-02-12 18:44:46,603 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of
>>> blocks = 196
>>> 2011-02-12 18:44:46,603 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid
>>> blocks = 0
>>> 2011-02-12 18:44:46,603 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
>>> under-replicated blocks = 29
>>> 2011-02-12 18:44:46,603 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
>>> over-replicated blocks = 41
>>> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
>>> STATE* Leaving safe mode after 69 secs.
>>> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
>>> STATE* Safe mode is OFF.
>>> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
>>> STATE* Network topology has 1 racks and 5 datanodes
>>> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
>>> STATE* UnderReplicatedBlocks has 29 blocks
>>> 2011-02-12 18:44:46,886 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK* ask 192.168.1.14:50010 to replicate
>>> blk_-8806907658071633346_1750 to datanode(s) 192.168.1.83:50010
>>> 2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK* ask 192.168.1.83:50010 to replicate
>>> blk_-7689075547598626554_1800 to datanode(s) 192.168.1.10:50010
>>> 2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK* ask 192.168.1.84:50010 to replicate
>>> blk_-7587424527299099175_1717 to datanode(s) 192.168.1.10:50010
>>> 2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK* ask 192.168.1.84:50010 to replicate
>>> blk_-6925943363757944243_1909 to datanode(s) 192.168.1.13:50010
>>> 2011-02-12 18:44:46,888 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK* ask 192.168.1.14:50010 to replicate
>>> blk_-6835423500788375545_1928 to datanode(s) 192.168.1.10:50010
>>> 2011-02-12 18:44:46,888 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK* ask 192.168.1.83:50010 to replicate
>>> blk_-6477488774631498652_1742 to datanode(s) 192.168.1.84:50010
>>> 2011-02-12 18:44:46,889 WARN
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>> ReplicationMonitor thread received Runtime exception.
>>> java.lang.IllegalStateException: generationStamp (=1) ==
>>> GenerationStamp.WILDCARD_STAMP java.lang.IllegalStateException:
>>> generationStamp (=1) == GenerationStamp.WILDCARD_STAMP
>>>         at
>>> org.apache.hadoop.hdfs.protocol.Block.validateGenerationStamp(Block.java:148)
>>>         at
>>> org.apache.hadoop.hdfs.protocol.Block.compareTo(Block.java:156)
>>>         at org.apache.hadoop.hdfs.protocol.Block.compareTo(Block.java:30)
>>>         at java.util.TreeMap.put(TreeMap.java:545)
>>>         at java.util.TreeSet.add(TreeSet.java:238)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.addBlocksToBeInvalidated(DatanodeDescriptor.java:284)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.invalidateWorkForOneNode(FSNamesystem.java:2743)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.computeInvalidateWork(FSNamesystem.java:2419)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.computeDatanodeWork(FSNamesystem.java:2412)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:2357)
>>>         at java.lang.Thread.run(Thread.java:619)
>>> 2011-02-12 18:44:46,892 INFO
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>> /************************************************************
>>> SHUTDOWN_MSG: Shutting down NameNode at hadoop5/192.168.1.84
>>> ************************************************************/
>>>
>>>
>>> Thanks,
>>> Jameson
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>
>

Re: my hadoop cluster namenode crashed after modifying the timestamp in some of the nodes

Reply via email to