[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839710#comment-13839710
 ] 

Daryn Sharp commented on HADOOP-9640:
-------------------------------------

I haven't read all the docs, and I've only skimmed the patch, but the entire 
feature _must be configurable_.  As in a toggle to directly use the 
{{LinkedBlockingQueue}} as today.  An activity surge often isn't indicative of 
abuse, nor do I necessarily want heavy users to have priority above all others 
because there are multiple equal heavy users, nor do I want to debug priority 
inversions at this time. :)

I do think the patch might have potential performance benefits, as your graph 
mentions, from multiple queues lowering lock contention between the 100 hungry 
handlers.  I've been working to lower lock contention, so while in the RPC 
layer I considered playing with the callQ but it wasn't even a blip in the 
profiler.

However, you can't extrapolate performance improvements from 2 client threads, 
2 server threads, and multiple queues.  I think you've effectively eliminated 
any lock contention and given each client their own queue.  2 threads will 
produce negligible contention with even 1 queue.  Things don't get ugly till 
you have many threads contending.  Measurements with at least 16-32 clients & 
server threads become interesting!

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: MinorityMajorityPerformance.pdf, 
> NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to