[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

Suresh Srinivas (JIRA) Tue, 10 Dec 2013 10:30:49 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844506#comment-13844506
 ]


Suresh Srinivas commented on HADOOP-9640:
-----------------------------------------

I had in person meeting with [~chrili] on this. This is excellent work!

bq. Parsing the MapReduce job name out of the DFSClient name is kind of an ugly 
hack. The client name also isn't that reliable since it's formed from the 
client's Configuration
I had suggested this to [~chrili]. I realize that the configuration passed from 
MapReduce is actually a task ID. So the client name based on that will not be 
useful, unless we parse it to get the job ID.

I agree that this is not the way the final solution should work. I propose 
adding some kind of configuration that can be passed to establish context in 
which access to services is happening. Currently this is done by mapreduce 
framework. It sets the configuration "" which gets used in forming DFSClient 
name.

We could do the following to satisfy the various user requirements:
# Add a new configuration in common called "hadoop.application.context" to 
HDFS. Other services that want to do the same thing can either use this same 
configuration and find another way to configure it. This information should be 
marshalled from the client to the server. The congestion control can be built 
based on that.
# Lets also make identities used for accounting configurable. They can be 
either based on "context", "user", "token", or "default". That way people who 
do not like the default configuration can make changes.

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: MinorityMajorityPerformance.pdf, 
> NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
> faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
> faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

Reply via email to