[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839710#comment-13839710 ]
Daryn Sharp commented on HADOOP-9640: ------------------------------------- I haven't read all the docs, and I've only skimmed the patch, but the entire feature _must be configurable_. As in a toggle to directly use the {{LinkedBlockingQueue}} as today. An activity surge often isn't indicative of abuse, nor do I necessarily want heavy users to have priority above all others because there are multiple equal heavy users, nor do I want to debug priority inversions at this time. :) I do think the patch might have potential performance benefits, as your graph mentions, from multiple queues lowering lock contention between the 100 hungry handlers. I've been working to lower lock contention, so while in the RPC layer I considered playing with the callQ but it wasn't even a blip in the profiler. However, you can't extrapolate performance improvements from 2 client threads, 2 server threads, and multiple queues. I think you've effectively eliminated any lock contention and given each client their own queue. 2 threads will produce negligible contention with even 1 queue. Things don't get ugly till you have many threads contending. Measurements with at least 16-32 clients & server threads become interesting! > RPC Congestion Control with FairCallQueue > ----------------------------------------- > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.0.0, 2.2.0 > Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)