[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881556#comment-13881556
 ] 

Daryn Sharp commented on HADOOP-9640:
-------------------------------------

Agreed, this needs subtasks.  General comments/requests:
# Please make the default callq a {{BlockingQueue}} again, and have your custom 
implementations conform to the interface.
# The default callq should remain a {{LinkedBlockingQueue}}, not a 
{{FIFOCallQueue}}.  You're doing some pretty tricky locking and I'd rather 
trust the JDK.
# Call.getRemoteUser() would be much cleaner to get the UGI than an interface + 
enum to get user and group.
# Using the literal string "unknown!" for a user or group is not a good idea.

The more I think about it, multiple queues will exasperate congestion problem 
as Kihwal points out.  For that reason, I'd like to see minimal invasiveness in 
the Server class - I'll feel safe and you are free to experiment with alternate 
implementations.

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: MinorityMajorityPerformance.pdf, 
> NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
> faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
> faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to