[ https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yi Liu resolved HDFS-9293. -------------------------- Resolution: Duplicate > FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty > 'rpcId',which may cause standby NN too busy to communicate > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-9293 > URL: https://issues.apache.org/jira/browse/HDFS-9293 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.2.0, 2.7.1 > Reporter: 邓飞 > Assignee: 邓飞 > Fix For: 2.7.1 > > > In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog > slowly,and hold the fsnamesystem writelock during the work and the DN's > heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN > which can't send heartbeat because blocking at process Standby NN Regiest > common(FIXED at 2.7.1). > Below is the standby NN stack: > "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable > [0x00007f0dd1d76000] > java.lang.Thread.State: RUNNABLE > at java.util.PriorityQueue.remove(PriorityQueue.java:360) > at > org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217) > at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270) > - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292) > > When apply editLogOp,if the IPC retryCache is found,need to remove the > previous from priorityQueue(O(N)), The updateblock is don't need record > rpcId on editlog except 'client request updatePipeline',but we found many > 'UpdateBlocksOp' has repeat ipcId. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)