[jira] [Commented] (HELIX-655) Helix per-participant concurrent task throttling

ASF GitHub Bot (JIRA) Mon, 15 May 2017 10:38:27 -0700

    [ 
https://issues.apache.org/jira/browse/HELIX-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010960#comment-16010960
 ]


ASF GitHub Bot commented on HELIX-655:
--------------------------------------

Github user jiajunwang commented on a diff in the pull request:

    https://github.com/apache/helix/pull/89#discussion_r116554013
  
    --- Diff: 
helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
 ---
    @@ -90,60 +96,86 @@ private BestPossibleStateOutput compute(ClusterEvent 
event, Map<String, Resource
     
         BestPossibleStateOutput output = new BestPossibleStateOutput();
     
    -    for (String resourceName : resourceMap.keySet()) {
    -      logger.debug("Processing resource:" + resourceName);
    +    // Reset current INIT/RUNNING tasks on participants for throttling
    +    JobRebalancer.resetActiveTaskCount(cache.getLiveInstances().keySet(), 
currentStateOutput);
     
    -      Resource resource = resourceMap.get(resourceName);
    -      // Ideal state may be gone. In that case we need to get the state 
model name
    -      // from the current state
    -      IdealState idealState = cache.getIdealState(resourceName);
    +    PriorityQueue<JobResourcePriority> jobResourceQueue = new 
PriorityQueue<JobResourcePriority>();
    --- End diff --
    
    SortedSet is an interface, so you mean using TreeSet?
    Why do you think that will be better? I think there is no big difference 
between a TreeSet and PriorityQueue here.
    
    Besides, we do want to implement some kind of job priority here, as you can 
see. So PriorityQueue is easier for others to understand.


> Helix per-participant concurrent task throttling
> ------------------------------------------------
>
>                 Key: HELIX-655
>                 URL: https://issues.apache.org/jira/browse/HELIX-655
>             Project: Apache Helix
>          Issue Type: New Feature
>          Components: helix-core
>    Affects Versions: 0.6.x
>            Reporter: Jiajun Wang
>            Assignee: Junkai Xue
>
> h1. Overview
> Currently, all runnable jobs/tasks in Helix are equally treated. They are all 
> scheduled according to the rebalancer algorithm. Means, their assignment 
> might be different, but they will all be in RUNNING state.
> This may cause an issue if there are too many concurrently runnable jobs. 
> When Helix controller starts all these jobs, the instances may be overload as 
> they are assigning resources and executing all the tasks. As a result, the 
> jobs won't be able to finish in a reasonable time window.
> The issue is even more critical to long run jobs. According to our meeting 
> with Gobblin team, when a job is scheduled, they allocate resource for the 
> job. So in the situation described above, more and more resources will be 
> reserved for the pending jobs. The cluster will soon be exhausted.
> For solving the problem, an application needs to schedule jobs in a 
> relatively low frequency (what Gobblin is doing now). This may cause low 
> utilization.
> A better way to fix this issue, at framework level, is throttling jobs/tasks 
> that are running concurrently, and allowing setting priority for different 
> jobs to control total execute time.
> So given same amount of jobs, the cluster is in a better condition. As a 
> result, jobs running in that cluster have a more controllable execute time.
> Existing related control mechanisms are:
> * ConcurrentTasksPerInstance for each job
> * ParallelJobs for each workflow
> * Threadpool limitation on the participant if user customizes 
> TaskStateModelFactory.
> But none of them can directly help when concurrent workflows or jobs number 
> is large. If an application keeps scheduling jobs/jobQueues, Helix will start 
> any runnable jobs without considering the workload on the participants.
> The application may be able to carefully configures these items to achieve 
> the goal. But they won't be able to easily find the sweet spot. Especially 
> the cluster might be changing (scale out etc.).
> h2. Problem summary
> # All runnable tasks will start executing, which may overload the participant.
> # Application needs a mechanism to prioritize important jobs (or workflows). 
> Otherwise, important tasks may be blocked by other less important ones. And 
> allocated resource is wasted.
> h2. Feature proposed
> Based on our discussing, we proposed 2 features that can help to resolve the 
> issue.
> # Running task throttling on each participant. This is for avoiding overload.
> # Job priority control that ensures high priority jobs are scheduled earlier.
> In addition, application can leverage workflow/job monitor items as feedback 
> from Helix to adjust their stretgy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HELIX-655) Helix per-participant concurrent task throttling

Reply via email to