[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2015-02-06 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated HADOOP-9640:
-
Status: Open  (was: Patch Available)

Cancelling patch, as it no longer applies.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0, 3.0.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 For an easy-to-read summary see: 
 http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-10-20 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-
Description: 
For an easy-to-read summary see: 
http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/

Several production Hadoop cluster incidents occurred where the Namenode was 
overloaded and failed to respond. 

We can improve quality of service for users during namenode peak loads by 
replacing the FIFO call queue with a [Fair Call 
Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
 (this plan supersedes rpc-congestion-control-draft-plan).

Excerpted from the communication of one incident, “The map task of a user was 
creating huge number of small files in the user directory. Due to the heavy 
load on NN, the JT also was unable to communicate with NN...The cluster became 
responsive only once the job was killed.”

Excerpted from the communication of another incident, “Namenode was overloaded 
by GetBlockLocation requests (Correction: should be getFileInfo requests. the 
job had a bug that called getFileInfo for a nonexistent file in an endless 
loop). All other requests to namenode were also affected by this and hence all 
jobs slowed down. Cluster almost came to a grinding halt…Eventually killed 
jobtracker to kill all jobs that are running.”

Excerpted from HDFS-945, “We've seen defective applications cause havoc on the 
NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k 
files) etc.”


  was:
Several production Hadoop cluster incidents occurred where the Namenode was 
overloaded and failed to respond. 

We can improve quality of service for users during namenode peak loads by 
replacing the FIFO call queue with a [Fair Call 
Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
 (this plan supersedes rpc-congestion-control-draft-plan).

Excerpted from the communication of one incident, “The map task of a user was 
creating huge number of small files in the user directory. Due to the heavy 
load on NN, the JT also was unable to communicate with NN...The cluster became 
responsive only once the job was killed.”

Excerpted from the communication of another incident, “Namenode was overloaded 
by GetBlockLocation requests (Correction: should be getFileInfo requests. the 
job had a bug that called getFileInfo for a nonexistent file in an endless 
loop). All other requests to namenode were also affected by this and hence all 
jobs slowed down. Cluster almost came to a grinding halt…Eventually killed 
jobtracker to kill all jobs that are running.”

Excerpted from HDFS-945, “We've seen defective applications cause havoc on the 
NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k 
files) etc.”



 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 For an easy-to-read summary see: 
 http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for 

[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-04-23 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: FairCallQueue-PerformanceOnCluster.pdf

[~mingma] Thanks for the feedback, these sound like good next steps for the FCQ 
/ Scheduler

I have uploaded results of a benchmark on a real-world (87 node) cluster. It 
shows QoS successfully preventing denial of service situations, but also 
identifies limitations of the current HistoryRpcScheduler for scaling to 
greater history lengths.

I have a couple ideas floating around on how to fix this, but I will upload the 
current version soon to get feedback.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-02-18 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: HADOOP-10278-atomicref-adapter.patch

Latest version of atomicref-adapter swaps by using handlers to clear the calls, 
choreographed by using two refs for put and take.

We use a software version of double clocking to ensure the queue is likely 
empty. This should decrease the probability of dropping calls to highly 
unlikely. And losing calls isn't the end of the world either, since the client 
handles IPC timeouts with retries.

Tests updated too, since the queue can only be swapped when there are active 
readers.

Here's what a live swap looks like: !http://i.imgur.com/g28zJ7u.png!

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-02-18 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: (was: HADOOP-10278-atomicref-adapter.patch)

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-23 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: faircallqueue7_with_runtime_swapping.patch

Attached preview of patch that enables swapping the namenode call queue at 
runtime.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-16 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: faircallqueue6.patch

Uploaded new patch that adds configurable Call identity used for scheduling.

Config:
ipc.8020.call.identity = USER or GROUP

In the future, this can be extended with more options

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-09 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: faircallqueue4.patch

Added new version

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-09 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: faircallqueue5.patch

Updated patch to target trunk

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-06 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: faircallqueue2.patch

[~daryn] Definitely, this new patch is pluggable so that it defaults to the 
LinkedBlockingQueue via FIFOCallQueue. We will also be testing performance on 
larger clusters in January.

Please let me know your thoughts on this new patch.

In this new patch (faircallqueue2.patch):
*Architecture*
The FairCallQueue is responsible for its Scheduler and Mux, which in the future 
will be pluggable as well. It is not made pluggable now since there is only one 
option today.

Changes to NameNodeRPCServer (and others) are no longer necessary.

*Scheduling Token*
Using username right now, but will switch to jobID when a good way of including 
it is decided upon.

*Cross-server scheduling*
Scheduling across servers (for instance, the Namenode can have 2 RPC Servers 
for users and service calls) will be supported in a future patch.

*Configuration*
Configuration keys are keyed by port, so for a server running on 8020:

_ipc.8020.callqueue.impl_
Defaults to FIFOCallQueue.class, which uses a LinkedBlockingQueue. To enable 
priority, use org.apache.hadoop.ipc.FairCallQueue

_ipc.8020.faircallqueue.priority-levels_
Defaults to 4, controls the number of priority levels in the faircallqueue.

_ipc.8020.history-scheduler.service-users_
A comma separated list of users that will be exempt from scheduling and given 
top priority. Used for giving the service users (hadoop or hdfs) absolute high 
priority. e.g. hadoop,hdfs

_ipc.8020.history-scheduler.history-length_
The number of past calls to remember. HistoryRpcScheduler will schedule 
requests based on this pool. Defaults to 1000.

_ipc.8020.history-scheduler.thresholds_
A comma separated list of ints that specify the thresholds for scheduling in 
the history scheduler. For instance with 4 queues and a history-length of 1000: 
50,400,750 will schedule requests greater than 750 into queue 3,  400 into 
queue 2,  50 into queue 1, else into queue 0. Defaults to an even split (for a 
history-length of 200 and 4 queues it would be 50 each: 50,100,150)

_ipc.8020.wrr-multiplexer.weights_
A comma separated list of ints that specify weights for each queue. For 
instance with 4 queues: 10,5,5,1, which sets the handlers to draw from the 
queues with the following pattern:
* Read queue0 10 times
* Read queue1 5 times
* Read queue2 5 times
* Read queue3 1 time
And then repeat. Defaults to a log2 split: For 4 queues, it would be 8,4,2,1

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-06 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: faircallqueue3.patch

Update patch to target latest trunk

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-04 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Attachment: MinorityMajorityPerformance.pdf

Uploaded preview of performance with 2 users–one normal and the other abusive. 
Should have code up soon.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-03 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-

Summary: RPC Congestion Control with FairCallQueue  (was: RPC Congestion 
Control)

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)