[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838519#comment-13838519 ]
Andrew Wang commented on HADOOP-9640: ------------------------------------- Sorry, it looks like I read the wrong document. I'm glad you still found some of my comments useful, but I'll read the updated one too and get back to you :) > RPC Congestion Control with FairCallQueue > ----------------------------------------- > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.0.0, 2.2.0 > Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)