[ 
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-9293.
--------------------------
    Resolution: Duplicate

> FSEditLog's  'OpInstanceCache' instance of threadLocal cache exists dirty 
> 'rpcId',which may cause standby NN too busy  to communicate 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9293
>                 URL: https://issues.apache.org/jira/browse/HDFS-9293
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.2.0, 2.7.1
>            Reporter: 邓飞
>            Assignee: 邓飞
>             Fix For: 2.7.1
>
>
>   In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog 
> slowly,and hold the fsnamesystem writelock during the work and the DN's 
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN 
> which can't send heartbeat  because blocking at process Standby NN Regiest 
> common(FIXED at 2.7.1).
>   Below is the standby NN  stack:
> "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable 
> [0x00007f0dd1d76000]
>    java.lang.Thread.State: RUNNABLE
>       at java.util.PriorityQueue.remove(PriorityQueue.java:360)
>       at 
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
>       at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
>       - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>       at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>    
>     When apply editLogOp,if the IPC retryCache is found,need  to remove the 
> previous from priorityQueue(O(N)), The updateblock is don't  need record 
> rpcId on editlog except  'client request updatePipeline',but we found many 
> 'UpdateBlocksOp' has repeat ipcId.
>      
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to