Re: Question about the QJM HA namenode

mail list Wed, 03 Dec 2014 02:52:07 -0800

hadoop-2.3.0-cdh5.1.0

hi, i move QJM from the  l-hbase1.dba.dev.cn0 to another machine, and the down 
time reduced to 
5 mins, and the log on the l-hbase2.dba.dev.cn0 like below:


{log}
2014-12-03 15:55:51,306 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 197 edits 
starting from txid 6599
2014-12-03 15:55:51,306 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
datandoes as stale
2014-12-03 15:55:51,307 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
and invalidation queues
2014-12-03 15:55:51,307 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing replication 
queues
2014-12-03 15:55:51,307 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
edit logs at txnid 6797
2014-12-03 15:55:51,313 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Starting log segment at 6797
2014-12-03 15:55:51,373 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Number of transactions: 1 Total time for transactions(ms): 0 Number of 
transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 9
2014-12-03 15:55:51,385 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Starting 
CacheReplicationMonitor with interval 30000 milliseconds
2014-12-03 15:55:51,385 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning because of pending operations
2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode 
trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 
minutes.
2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: The 
configured checkpoint interval is 0 minutes. Using an interval of 1440 minutes 
that is used for deletion instead
2014-12-03 15:55:51,693 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of 
blocks            = 179
2014-12-03 15:55:51,693 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid 
blocks          = 0
2014-12-03 15:55:51,693 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of 
under-replicated blocks = 0
2014-12-03 15:55:51,693 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of  
over-replicated blocks = 0
2014-12-03 15:55:51,693 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks 
being written    = 4
2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE* 
Replication Queue initialization scan for invalid, over- and under-replicated 
blocks completed in 386 msec
2014-12-03 15:55:51,693 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 308 millisecond(s).
2014-12-03 15:56:21,385 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30000 milliseconds
2014-12-03 15:56:21,386 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 0 millisecond(s).
2014-12-03 15:56:51,386 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30001 milliseconds
2014-12-03 15:56:51,386 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 0 millisecond(s).
2014-12-03 15:57:21,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30000 milliseconds
2014-12-03 15:57:21,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 1 millisecond(s).
2014-12-03 15:57:51,386 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30000 milliseconds
2014-12-03 15:57:51,386 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 0 millisecond(s).
2014-12-03 15:58:21,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30000 milliseconds
2014-12-03 15:58:21,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 1 millisecond(s).
2014-12-03 15:58:51,386 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30000 milliseconds
2014-12-03 15:58:51,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 0 millisecond(s).
2014-12-03 15:59:21,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30001 milliseconds
2014-12-03 15:59:21,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 0 millisecond(s).
2014-12-03 15:59:51,387 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
Rescanning after 30000 milliseconds
2014-12-03 15:59:51,388 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 
0 directive(s) and 0 block(s) in 0 millisecond(s).
2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocateBlock: caught retry for allocation of a new block in 
/hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483.
 Returning previously allocated block 
blk_1073743458_2634{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[]}
{log}


It seems the from 15:55:51 to 16:00:14 , all is 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor, 
what is hadoop doing? how can i reduce the time cause 5 mins is too long!



On Dec 3, 2014, at 16:31, Harsh J <ha...@cloudera.com> wrote:

> What is your Hadoop version?
> 
> On Wed, Dec 3, 2014 at 12:55 PM, mail list <louis.hust...@gmail.com> wrote:
>> hi all,
>> 
>> Attach log again!
>> 
>> The failover happened at about time: 2014-12-03 12:01:
>> 
>> 
>> 
>> 
>> 
>> On Dec 3, 2014, at 14:55, mail list <louis.hust...@gmail.com> wrote:
>> 
>>> Sorry forget the log, the failover time at about 2014-12-03 12:01:
>>> 
>>> <hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log.tar.gz>
>>> On Dec 3, 2014, at 14:48, mail list <louis.hust...@gmail.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I deploy the hadoop with 3 machines:
>>>> 
>>>> l-hbase1.dba.dev.cn0 (namenode active and QJM)
>>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM)
>>>> l-hbase3.dba.dev.cn0 (datanode and QJM)
>>>> 
>>>> Above the hadoop, i deploy a hbase:
>>>> l-hbase1.dba.dev.cn0 (HMaster active)
>>>> l-hbase2.dba.dev.cn0 (HMaster standby)
>>>> l-hbase3.dba.dev.cn0 (RegionServer)
>>>> 
>>>> 
>>>> I write a program which put data into hbase one row every seconds in a 
>>>> loop.
>>>> Then I use iptables to  simulate l-hbase1.dba.dev.cn0 offline，and after 
>>>> that , the program hang and can not
>>>> write to hbase. After about 15 mins, the program can write again.
>>>> 
>>>> The time 15mins for the HA failover is too long for me!
>>>> And I’ve no idea about the reason.
>>>> 
>>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find many retry 
>>>> like below:
>>>> {code}
>>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried 1 
>>>> time(s); retry policy is 
>>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
>>>> MILLISECONDS)
>>>> {code}
>>>> 
>>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter?
>>>> 
>>>> I am a newbie, Any idea will be appreciated!!
>>> 
>> 
>> 
> 
> 
> 
> -- 
> Harsh J

Re: Question about the QJM HA namenode

Reply via email to