[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979194#comment-13979194 ]
Hadoop QA commented on HADOOP-9640: ----------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12641612/FairCallQueue-PerformanceOnCluster.pdf against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3841//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > ----------------------------------------- > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.0.0, 2.2.0 > Reporter: Xiaobo Peng > Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.2#6252)