[
https://issues.apache.org/jira/browse/HADOOP-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646873#action_12646873
]
Sreekanth Ramakrishnan commented on HADOOP-4513:
------------------------------------------------
After off-line discussion with Hemanth and Vivek, following is the proposal for
implementing asynchronous initialization of jobs by capacity Scheduler:
- Modify _CapacityTaskScheduler_ to look only at the Run-queue maintained by
_JobQueueManager_. This queue contains all initialized jobs.
- Modify _JobQueueManager_ to change semantics of waiting job queue to a list
of jobs which with are waiting to be scheduled. Please note that when a job is
waiting to be scheduled it means, that there is a possibility that a Job J1
would be in both running and job queue at same time. When the first map or
reduce of the job is scheduled, the job would be removed from the job queue
which _JobQueueManager_ maintains.
- Introduce a new poller class, which looks at the
_JobQueueManager.getJobs(queue)_ and picks up tasks to initialize for that
queue.
- Following will be parameters which would be parameters which would be used
for selecting jobs for eager initialization:
-- Maximum jobs which can be initialized per user. This would be a
configuration parameter which would be introduced in _capacity_scheduler.xml_
-- Number of concurrent users supported by the queue, so the initialization
poller would initialize ((userlimits/100) + 2 ) user jobs.
- The selected jobs would be passed on to worker threads, which can be assigned
duty of initializing jobs from one or more queues.
- The worker thread maintains separate lists for jobs from different queues
sorted by priority as same as _JobQueueManager_
- The worker thread then initializes the jobs from queues in a round robin
fashion amongst the job queues assigned to it, i.e. it initializes first job
from q1 and then first job from q2.
Illustration:
Consider a job queue : q which can support one con-current user (i.e.
userlimits = 100%). Three users U1,U2,U3 are submittign jobs in following
distribution:
Maximum number of jobs to be initialized per user : 2
J1U1,J2U1,J3U1,J4U1,J1U2,J2U2,J3U3,J4U4,J1U3,J2U3,J3U3,J4U3.
Jobs initialized by the Initialization threads would be:
J1U1,J2U1,J1U2,J2U2,J1U3,J2U3.
And all these are just initialized but not scheduled and a User U4 submits a
very high priority Job and a normal priority, so our job queue in t+1 instance
would look like :
J1U4,J1U1,J2U1,J3U1,J4U1,J1U2,J2U2,J3U3,J4U4,J1U3,J2U3,J3U3,J4U3,J2U4.
So next iteration poller would have initialized following :
J1U4,J1U1,J2U1,J1U2,J2U2,J1U3,J2U3.
Please note that U4's second job would not be initialized.
If user1 had submitted the very high priority Job then he would be crossing the
maximum limit of jobs which are allowed to be initialized per user.
In above example if J1U1 is a job which takes long initialization time, the
next job to be initialized would be the next highest priority or highest
priority jobs (if the job is submitted late as above example).
Any thoughts on the above approach?
> Capacity scheduler should initialize tasks asynchronously
> ---------------------------------------------------------
>
> Key: HADOOP-4513
> URL: https://issues.apache.org/jira/browse/HADOOP-4513
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/capacity-sched
> Affects Versions: 0.19.0
> Reporter: Hemanth Yamijala
> Assignee: Sreekanth Ramakrishnan
>
> Currently, the capacity scheduler initializes tasks on demand, as opposed to
> the eager initialization technique used by the default scheduler. This is
> done in order to save JT memory footprint. However, the initialization is
> done in the {{assignTasks}} API which is not a good idea as task
> initialization could be a time consuming operation. This JIRA is to move out
> the initialization outside the {{assignTasks}} API and do it asynchronously.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.