Hunter L created HELIX-730:
------------------------------
Summary: [TASK] Add ThreadCountBasedAssignmentCalculator and
integrate with Workflow/JobRebalancer and fix rebalancing logic
Key: HELIX-730
URL: https://issues.apache.org/jira/browse/HELIX-730
Project: Apache Helix
Issue Type: Improvement
Reporter: Hunter L
For quota-based scheduling of tasks, we have added the TaskAssigner interface
that takes into account AssignableInstances by way of
AssignableInstanceManager. In order to use this in the currently-existing
pipeline prior to Task Framework 2.0, GenericTaskAssignmentCalculator was
replaced with ThreadCountBasedAssignmentCalculator, which is a wrapper around
TaskAssigner. Necessary adjustments were made in Workflow/JobRebalancer for
this replacement. Also the rebalance logic in Workflow/JobRebalancer was
reviewed and fixed. Additionally, TestQuotaBasedScheduling is added to test
quota-based task scheduling. Note that quotas will apply to both generic and
targeted jobs.
A few bugs were uncovered during this process such as the faulty retry logic
that never really got tasks to restart. For more details, see the changelist
below:
Changelist:
1. Add ThreadCountBasedAssignmentCalculator, a wrapper around
ThreadCountBasedTaskAssigner
2. Make logic changes in JobRebalancer to enable the use of
ThreadCountBasedAssignmentCalculator
3. Fix the failing test by using a thread-safe map and rename
TestGenericTaskAssignmentCalculator to TestTaskAssignmentCalculator to better
reflect what its tests are doing
4. Add retry logic that was previously absent for INIT and DROPPED tasks in
JobRebalancer
5. Add TestQuotaBasedScheduling to test that jobs and tasks were being
assigned and scheduled per quota config set in ClusterConfig
6. Add more log messages to aid with task-scheduling debugging in
AssignableInstance
7. In AbstractTaskDispatcher, for tasks that are STOPPED, TIMED_OUT,
TASK_ERROR, the retry logic was newly implemented so that they get re-started
correctly
8. In AbstractTaskDispatcher, when enforcing overlapAssign for jobs with
isAllowOverlapAssignment(), a fix was implemented so that only jobs whose state
is IN_PROGRESS are considered
9. In AbstractTaskDispatcher, isWorkflowFinished() method was modified so
that non-active jobs will have their tasks' resource freed from
AssignableInstances to prevent resource leak
10. In markJobFailed() and markJobCompleted(), non-active jobs will have
their tasks' resource freed from AssignableInstances to prevent resource leak
11. Fix the logic so that quotas do not apply to targeted jobs
12. Fix TestTaskRebalancer (assumes Consistent Hashing, which is no longer
used)
13. Fix TestIndependentTaskRebalancer (assumes Consistent Hashing, no longer
used)
14. Assignment logic was improved so that incomplete tasks whose assigned
participants are no longer live will be re-assigned accordingly
15. Fix TestTaskRebalanceFailover (tasks on non-live instances will be
re-assigned promptly)
16. Fix TestRebalanceRunningTask (targeted jobs will get tasks reassigned
upon liveInstance and currentState change)
17. Fix a bug in FixedAssignmentCalculator and assignment logic for targeted
jobs such that a task index will no longer be assigned multiple times
18. Fix TestJobFailureTaskNotStarted (tasks were not being assigned at all
due to having reached maximum capacity for quota)
19. Add targetedTaskConfigMap field in JobConfig to cache TaskConfig objects
for targeted tasks to reduce object creation and GC overload
20. Fix JobConfig so that it doesn't write quotaType to ZooKeeper when
quotaType is null or not set
21. Fix deleteWorkflow() in TaskUtil so that the earliest delete failure
will render the entire method as failed (and return prematurely to prevent
breaking other ZNodes from incomplete deletion)
22. Fix TestDeleteWorkflow by adding another removeProperty() clause to
lower failure rate
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)