[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763778#comment-16763778 ] Erik Krogen commented on HADOOP-9640: - Moved HADOOP-10286, HADOOP-13029, HADOOP-10599, HADOOP-10598 to issues instead of subtasks. Now with all subtasks complete, I will close out this umbrella. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0-alpha1 >Reporter: Xiaobo Peng >Assignee: Chris Li >Priority: Major > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763774#comment-16763774 ] Wei-Chiu Chuang commented on HADOOP-9640: - Works for me. Thanks for the help here! We didn't enable FairCallQueue and the related features in CDH5 and we just started looking into this feature. I am reviewing HADOOP-10286 (as well as HDFS-12345) > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0-alpha1 >Reporter: Xiaobo Peng >Assignee: Chris Li >Priority: Major > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763772#comment-16763772 ] Erik Krogen commented on HADOOP-9640: - Hi [~linyiqun], I just uploaded some documentation at HADOOP-16097. Please take a look! [~jojochuang], regarding your comment about cleaning up this JIRA, I am thinking to move all unresolved subtasks out (I would consider all of the as-yet unresolved tasks as follow-ons) and close this umbrella. Let me know your thoughts. Also if you have some time to help review HADOOP-10286 it would be appreciated :) > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0-alpha1 >Reporter: Xiaobo Peng >Assignee: Chris Li >Priority: Major > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526157#comment-16526157 ] Yiqun Lin commented on HADOOP-9640: --- Hi all, Any progress on this issue? This looks like a good feature and can be used in production env, any documentation where we can learn from that how to configure using this? Appropriate for this, :). > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0-alpha1 >Reporter: Xiaobo Peng >Assignee: Chris Li >Priority: Major > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733724#comment-14733724 ] Ajith S commented on HADOOP-9640: - I missed checking the sub-tasks. Will like to update documentation for FairCallQueue as it has lot of configuration introduced. It will be better if we can explain it briefly similar to http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. Added a sub-task for it. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731185#comment-14731185 ] Chris Li commented on HADOOP-9640: -- [~ajithshetty] Sure, what did you have in mind? > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730272#comment-14730272 ] Ajith S commented on HADOOP-9640: - Hi [~chrilisf] Any progress on the issue.? If you are not looking into this, i would like to continue work :) > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.2.0, 3.0.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177698#comment-14177698 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12641612/FairCallQueue-PerformanceOnCluster.pdf against trunk revision e90718f. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4932//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > For an easy-to-read summary see: > http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/ > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002801#comment-14002801 ] Ming Ma commented on HADOOP-9640: - Sorry, there was a typo about, I meant "thousands of RPC requests" in RPC queue. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990308#comment-13990308 ] Ming Ma commented on HADOOP-9640: - Thanks, Chris. 1. The current approach drops call when RPC queue is full and the client relies on RPC timeout. It will be interesting to confirm if it is useful to have RPC server throw some exception back to client and have client do exponential back off; or maybe just block the RPC reader thread instead. 2. RPC-based approach didn't account for http request such as webHDFS. Based on some test results, it seems Jetty uses around 250 threads, small compared to the thousands of RPC handler threads. a) The bad application traffic from webHDFS still has impact on RPC latency, not as severe compared to the RPC case. b), if there are SLA jobs based on webHDFS, then the RPC throttling won't help much. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988113#comment-13988113 ] Chris Li commented on HADOOP-9640: -- Uploaded patches to HADOOP-10279, HADOOP-10281, and HADOOP-10282 for feedback. The new scheduler fixes the performance issues identified in the earlier PDF too. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979194#comment-13979194 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12641612/FairCallQueue-PerformanceOnCluster.pdf against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3841//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: FairCallQueue-PerformanceOnCluster.pdf, > MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, > faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977949#comment-13977949 ] Ming Ma commented on HADOOP-9640: - Nice work. Some high-level comments, 1. At some point, we might need to prioritize DN RPC over client RPC so that no matter what application do to NN RPC and FSNamesystem's global lock, DN's requests will be processed timely. We can do it in two ways. a) config a global RPC server and have the pluggable CallQueue handle that. b) have one RPC server for client and one RPC server of service request, for that we will need some abstraction like https://issues.apache.org/jira/browse/HDFS-5639. 2. CallQueue priority policy. Perhaps this could leave it to the plugin implementation. It can be somewhat soft policy like FaiCallQueue, or with some sort of allocation quota like other schedulers, .e.g., if we know the cluster has allocated 50% to some group at YARN layer, perhaps it is ok to assume that NN RPC request for that group can be around 50%. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915068#comment-13915068 ] Chris Li commented on HADOOP-9640: -- Hey all, Can anyone check out https://issues.apache.org/jira/browse/HADOOP-10280 and give feedback on the next stage? > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng >Assignee: Chris Li > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904731#comment-13904731 ] Chris Li commented on HADOOP-9640: -- Ignore above comment, it was meant for HADOOP-10278 > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881747#comment-13881747 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624909/faircallqueue7_with_runtime_swapping.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1547 javac compiler warnings (more than the trunk's current 1546 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.ipc.TestQueueRuntimeReconfigure org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881614#comment-13881614 ] Chris Li commented on HADOOP-9640: -- I've uploaded the first of the patches to https://issues.apache.org/jira/browse/HADOOP-10278 It allows the user to use a custom call queue specified via configuration, but falls back on a LinkedBlockingQueue otherwise. I'd like to take any further discussions about this aspect to the subtask, and get some feedback. Thanks > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881571#comment-13881571 ] Chris Li commented on HADOOP-9640: -- [~daryn] Thanks for your feedback. Some points of clarification: 3. The identity is meant to be configurable, so you can schedule by user, by group, and in the future by job. 4. Any suggestions? > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881556#comment-13881556 ] Daryn Sharp commented on HADOOP-9640: - Agreed, this needs subtasks. General comments/requests: # Please make the default callq a {{BlockingQueue}} again, and have your custom implementations conform to the interface. # The default callq should remain a {{LinkedBlockingQueue}}, not a {{FIFOCallQueue}}. You're doing some pretty tricky locking and I'd rather trust the JDK. # Call.getRemoteUser() would be much cleaner to get the UGI than an interface + enum to get user and group. # Using the literal string "unknown!" for a user or group is not a good idea. The more I think about it, multiple queues will exasperate congestion problem as Kihwal points out. For that reason, I'd like to see minimal invasiveness in the Server class - I'll feel safe and you are free to experiment with alternate implementations. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881465#comment-13881465 ] Chris Li commented on HADOOP-9640: -- [~kihwal] In the first 6 versions of this patch, this does indeed happen. It's partially alleviated due to the round-robin withdrawal from the queues. In the latest iteration of the patch (7), the reader threads would lock on the queue's putLock like they do in trunk. I think this behavior is more intuitive. Today I will be breaking this JIRA to make it easier to review. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881428#comment-13881428 ] Kihwal Lee commented on HADOOP-9640: Can low priority requests starve higher priority requests? If a low priority call queue is full and all reader threads are blocked on put() for adding calls belonging to that queue, newly arriving higher priority requests won't get processed even if their corresponding queue is not full. If the request rate stays greater than service rate for some time in this state, the listen queue will likely overflow and all types of requests will suffer regardless of priority. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > faircallqueue7_with_runtime_swapping.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880231#comment-13880231 ] Mayank Bansal commented on HADOOP-9640: --- Hi [~sureshms] Can you please take a look at this jira? Thanks, Mayank > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868553#comment-13868553 ] Chris Li commented on HADOOP-9640: -- I'd like to get people's thoughts on a new feature: Dynamic reconfiguration h5. Motivation 1. The tuning of parameters will be important for optimal performance 2. We can recover faster from bad parameters 3. The cost of doing a NN failover to change parameters is too high, while this would be much faster (seconds) h5. User Interface Much like `hadoop mradmin -refreshQueueAcls` the user would run the command to reload the CallQueue based on config. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849674#comment-13849674 ] Hadoop QA commented on HADOOP-9640: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618954/faircallqueue6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3361//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3361//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, faircallqueue6.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844743#comment-13844743 ] Chris Li commented on HADOOP-9640: -- bq. Add a new configuration in common called "hadoop.application.context" to HDFS. Other services that want to do the same thing can either use this same configuration and find another way to configure it. This information should be marshalled from the client to the server. The congestion control can be built based on that. Just to be clear, would an example be, 1. Cluster operator specifies ipc.8020.application.context = hadoop.yarn 2. Namenode sees this, knows to load the class that generates job IDs from the Connection/Call? Or were you thinking of physically adding the id into the RPC call itself, which would make the rpc call size larger, but is a cleaner solution (albeit one that the client could spoof). bq. Lets also make identities used for accounting configurable. They can be either based on "context", "user", "token", or "default". That way people who do not like the default configuration can make changes. Sounds like a good idea. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844506#comment-13844506 ] Suresh Srinivas commented on HADOOP-9640: - I had in person meeting with [~chrili] on this. This is excellent work! bq. Parsing the MapReduce job name out of the DFSClient name is kind of an ugly hack. The client name also isn't that reliable since it's formed from the client's Configuration I had suggested this to [~chrili]. I realize that the configuration passed from MapReduce is actually a task ID. So the client name based on that will not be useful, unless we parse it to get the job ID. I agree that this is not the way the final solution should work. I propose adding some kind of configuration that can be passed to establish context in which access to services is happening. Currently this is done by mapreduce framework. It sets the configuration "" which gets used in forming DFSClient name. We could do the following to satisfy the various user requirements: # Add a new configuration in common called "hadoop.application.context" to HDFS. Other services that want to do the same thing can either use this same configuration and find another way to configure it. This information should be marshalled from the client to the server. The congestion control can be built based on that. # Lets also make identities used for accounting configurable. They can be either based on "context", "user", "token", or "default". That way people who do not like the default configuration can make changes. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843586#comment-13843586 ] Hadoop QA commented on HADOOP-9640: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617900/faircallqueue5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3348//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3348//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843551#comment-13843551 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617895/faircallqueue4.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3347//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841762#comment-13841762 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617475/faircallqueue3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1546 javac compiler warnings (more than the trunk's current 1545 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3345//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3345//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3345//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, faircallqueue3.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841670#comment-13841670 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617457/faircallqueue2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3343//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > faircallqueue2.patch, rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839710#comment-13839710 ] Daryn Sharp commented on HADOOP-9640: - I haven't read all the docs, and I've only skimmed the patch, but the entire feature _must be configurable_. As in a toggle to directly use the {{LinkedBlockingQueue}} as today. An activity surge often isn't indicative of abuse, nor do I necessarily want heavy users to have priority above all others because there are multiple equal heavy users, nor do I want to debug priority inversions at this time. :) I do think the patch might have potential performance benefits, as your graph mentions, from multiple queues lowering lock contention between the 100 hungry handlers. I've been working to lower lock contention, so while in the RPC layer I considered playing with the callQ but it wasn't even a blip in the profiler. However, you can't extrapolate performance improvements from 2 client threads, 2 server threads, and multiple queues. I think you've effectively eliminated any lock contention and given each client their own queue. 2 threads will produce negligible contention with even 1 queue. Things don't get ugly till you have many threads contending. Measurements with at least 16-32 clients & server threads become interesting! > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839654#comment-13839654 ] Hadoop QA commented on HADOOP-9640: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617098/MinorityMajorityPerformance.pdf against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3335//console This message is automatically generated. > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, > NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, > rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838519#comment-13838519 ] Andrew Wang commented on HADOOP-9640: - Sorry, it looks like I read the wrong document. I'm glad you still found some of my comments useful, but I'll read the updated one too and get back to you :) > RPC Congestion Control with FairCallQueue > - > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0 >Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: NN-denial-of-service-updated-plan.pdf, > faircallqueue.patch, rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode was > overloaded and failed to respond. > We can improve quality of service for users during namenode peak loads by > replacing the FIFO call queue with a [Fair Call > Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. > (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, “The map task of a user was > creating huge number of small files in the user directory. Due to the heavy > load on NN, the JT also was unable to communicate with NN...The cluster > became responsive only once the job was killed.” > Excerpted from the communication of another incident, “Namenode was > overloaded by GetBlockLocation requests (Correction: should be getFileInfo > requests. the job had a bug that called getFileInfo for a nonexistent file in > an endless loop). All other requests to namenode were also affected by this > and hence all jobs slowed down. Cluster almost came to a grinding > halt…Eventually killed jobtracker to kill all jobs that are running.” > Excerpted from HDFS-945, “We've seen defective applications cause havoc on > the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories > (60k files) etc.” -- This message was sent by Atlassian JIRA (v6.1#6144)