[ 
https://issues.apache.org/jira/browse/HADOOP-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558734#action_12558734
 ] 

Sanjay Radia commented on HADOOP-2491:
--------------------------------------


Hadoop-2510's analysis that a job should not get a private job-cluster but 
merely an ability to run tasks on nodes that have free capacity makes sense. 
The reservation of a job-cluster is one of the key causes of the low 
utilization in the current system.

BTW, as a few others have noted the scheduling function belongs as a separate 
layer rather than being part of MR. Hence I am commenting on this Jira instead 
of in 2510.

While are I see the simplicity of having one scheduler I think we may not quite 
get away with that. I believe we will need two schedulers. 
The job scheduler's role is to move a submitted job into the run-queue of the 
grid when the grid has sufficient resources to be able complete the job 
satisfactorily.  Once in the run queue, the job generates tasks which are 
scheduled by the task scheduler. Without a job scheduler, too many jobs may 
fight for running tasks and all of them progress too slowly. E.g. consider two 
very large jobs that don't fit in the grid when run together. We need a way for 
the system to automatically schedule these 2 very large jobs at different 
times. E.g. Large-job1 can run with a bunch of smaller jobs and later 
Large-job2 and run with other small jobs. At the very least the job scheduler 
admits jobs into the run queue only if the number of tasks in the task queue is 
small 

BTW I suspect that for map-reduce jobs, we may be able to get away with a very 
simplistic job-scheduler that uses  priorities and takes advantage of the fact 
that  all the mappers are created initially and the reducers follow the 
mappers. Hence if the task queues are priority based and FCFS and furthermore 
reduce tasks are give a higher priority then things may work with a simple job 
scheduler. But more complex jobs may need a sophisticated job scheduler. My 
main point is that the abstraction of a job scheduler is needed. 

Comparison of Job Grid to Service Grid
It is also worth comparing the job grid to a service grid. In the end it would 
be desirable to have a single grid that is used for both jobs and service. 
(Hence the scheduling function should be moved out of MR in a lower layer.)
There are many similarities and some differences between services and jobs and 
their corresponding grids. 
So what is a service and a service grid? A service is like a "persistent job"; 
a service runs "forever" (more accurately it runs till it is removed). 
Acme.com's web site is an example of a service (actually it is probably 3 
services representing the typical 3-tier structure. The service is persistent. 
Most of the interesting service are horizontally scaled - a service has service 
instances (e.g. the instances of the apache servers of acme.com's web service); 
the number of service instances shrink and grow as the load on the service 
shrinks and grows. Hence the service instances are like the tasks. Indeed the 
task scheduler could easily schedule the service instances.

A service has a Service Level Objective (SLO) manager that monitors the load on 
a service and the capacity it needs to support that load; when the capacity is 
not sufficient the SLO manager requests additional resources. Perhaps this is 
like the job manager (one usually keeps the service manager and its SLO-manager 
separate for various reasons.)

Is there a concept of a service scheduler (analogous to job scheduler) in a 
service grid? Note that unlike a job grid, the set of services in a 
service-grid do not change as rapidly as the set of jobs in a job grid. The 
services tend to be fairly persistent while their sets of service instances 
(i.e. their tasks) are very dynamic. There is a capacity planning function in a 
service grid which is a fairly manual function in current generation of service 
grids. The capacity planning function is suppose to ascertain that the SLOs of 
the services can be satisfied. Since one does not want to configure the service 
grid with the max resources for each service one has deal with occasional 
resource shortage. The good news is that often the service mix is such that the 
peak load on all the services do nit occur at the same time. While that helps, 
one still has to deal resource shortage; many systems have service priorities 
and move resources from low priority services to higher priority services 
(while still leaving the low priority services with a min set of resources as 
defined in its SLO specification.) Automated service capacity planning is still 
in early research stage.

Getting back to jobs.
To recap 
•       The proposal of getting rid of the notion of a job cluster and 
dynamically scheduling the tasks of job to the available resources on the 
compute nodes as described in this JIRA makes a lot of sense
•       The notion of task-level scheduling makes a lot of sense. Indeed I 
believe we can use the same task scheduler for service instances (i.e. the 
tasks of a service) scheduling. We should call it a task scheduler - this Jira 
repeatedly points out that the "job scheduler" is really the "task scheduler. 
Just call it that.
•       WE may want the task-scheduler to take action on related tasks of a job 
(e.g. gang scheduling) and hence the job id should be a visible attribute of a 
task.
•       Perhaps we need to introduce a notion of a Job-SLO  - this is something 
that defines the resource requirements of job. Furthermore map-reduce jobs may 
have to be modeled as 2 sub-jobs with different Job-SLOs and priorities.
•       Need a job scheduler in addition to the task scheduler. The job 
scheduler will admit jobs into the grid's run queue when Job-SLOs can be 
satisfactorily met. It is possible that we may get away with a very simplistic 
job scheduler for map-reduce jobs. The abstraction of a job scheduler appears 
to be useful.
•       May need to add support for multi-phase jobs: each a series of 
map-reduce jobs where the output of is fed into another. Such multi-phased jobs 
may need to be scheduled as a collection.
•       The task tracker concept can probably be made more generic.  An 
important function of the task tracker is to monitor the load of their 
respective nodes. It is desirable for tasks to be scheduled not merely on 
number of task slots but on the actual load on the node. 

There are similarities between services and job grids and it is worth exploring 
the similarities to get the right abstractions. As far as jobs and services 
living in the same grid - further investigation is needed.



> generalize the TT / JT servers to handle more generic tasks
> -----------------------------------------------------------
>
>                 Key: HADOOP-2491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2491
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: eric baldeschwieler
>
> We've been discussing a proposal to generalize the TT / JT servers to handle 
> more generic tasks and move job specific work out of the job tracker and into 
> client code so the whole system is both much more general and has more 
> coherent layering.  The result would look more like condor/pbs like systems 
> (or presumably borg) with map-reduce as a user job.
> Such a system would allow the current map-reduce code to coexist with other 
> work-queuing libraries or maybe even persistent services on the same Hadoop 
> cluster, although that would be a stretch goal.  We'll kick off a thread with 
> some documents soon.
> Our primary goal in going this way would be to get better utilization out of 
> map-reduce clusters and support a richer scheduling model.  The ability to 
> support alternative job frameworks would just be gravy!
> ----
> Putting this in as a place holder.  Hope to get folks talking about this to 
> post some more detail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to