[ 
https://issues.apache.org/jira/browse/HDFS-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225424#comment-14225424
 ] 

Hudson commented on HDFS-7097:
------------------------------

FAILURE: Integrated in Hadoop-trunk-Commit #6605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6605/])
HDFS-7097. Allow block reports to be processed during checkpointing on standby 
name node. (kihwal via wang) (wang: rev 
f43a20c529ac3f104add95b222de6580757b3763)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java


> Allow block reports to be processed during checkpointing on standby name node
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-7097
>                 URL: https://issues.apache.org/jira/browse/HDFS-7097
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: HDFS-7097.patch, HDFS-7097.patch, HDFS-7097.patch, 
> HDFS-7097.patch, HDFS-7097.ultimate.trunk.patch
>
>
> On a reasonably busy HDFS cluster, there are stream of creates, causing data 
> nodes to generate incremental block reports.  When a standby name node is 
> checkpointing, RPC handler threads trying to process a full or incremental 
> block report is blocked on the name system's {{fsLock}}, because the 
> checkpointer acquires the read lock on it.  This can create a serious problem 
> if the size of name space is big and checkpointing takes a long time.
> All available RPC handlers can be tied up very quickly. If you have 100 
> handlers, it only takes 34 file creates.  If a separate service RPC port is 
> not used, HA transition will have to wait in the call queue for minutes. Even 
> if a separate service RPC port is configured, hearbeats from datanodes will 
> be blocked. A standby NN  with a big name space can lose all data nodes after 
> checkpointing.  The rpc calls will also be retransmitted by data nodes many 
> times, filling up the call queue and potentially causing listen queue 
> overflow.
> Since block reports are not modifying any state that is being saved to 
> fsimage, I propose letting them through during checkpointing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to