[ https://issues.apache.org/jira/browse/HADOOP-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558734#action_12558734 ]
Sanjay Radia commented on HADOOP-2491: -------------------------------------- Hadoop-2510's analysis that a job should not get a private job-cluster but merely an ability to run tasks on nodes that have free capacity makes sense. The reservation of a job-cluster is one of the key causes of the low utilization in the current system. BTW, as a few others have noted the scheduling function belongs as a separate layer rather than being part of MR. Hence I am commenting on this Jira instead of in 2510. While are I see the simplicity of having one scheduler I think we may not quite get away with that. I believe we will need two schedulers. The job scheduler's role is to move a submitted job into the run-queue of the grid when the grid has sufficient resources to be able complete the job satisfactorily. Once in the run queue, the job generates tasks which are scheduled by the task scheduler. Without a job scheduler, too many jobs may fight for running tasks and all of them progress too slowly. E.g. consider two very large jobs that don't fit in the grid when run together. We need a way for the system to automatically schedule these 2 very large jobs at different times. E.g. Large-job1 can run with a bunch of smaller jobs and later Large-job2 and run with other small jobs. At the very least the job scheduler admits jobs into the run queue only if the number of tasks in the task queue is small BTW I suspect that for map-reduce jobs, we may be able to get away with a very simplistic job-scheduler that uses priorities and takes advantage of the fact that all the mappers are created initially and the reducers follow the mappers. Hence if the task queues are priority based and FCFS and furthermore reduce tasks are give a higher priority then things may work with a simple job scheduler. But more complex jobs may need a sophisticated job scheduler. My main point is that the abstraction of a job scheduler is needed. Comparison of Job Grid to Service Grid It is also worth comparing the job grid to a service grid. In the end it would be desirable to have a single grid that is used for both jobs and service. (Hence the scheduling function should be moved out of MR in a lower layer.) There are many similarities and some differences between services and jobs and their corresponding grids. So what is a service and a service grid? A service is like a "persistent job"; a service runs "forever" (more accurately it runs till it is removed). Acme.com's web site is an example of a service (actually it is probably 3 services representing the typical 3-tier structure. The service is persistent. Most of the interesting service are horizontally scaled - a service has service instances (e.g. the instances of the apache servers of acme.com's web service); the number of service instances shrink and grow as the load on the service shrinks and grows. Hence the service instances are like the tasks. Indeed the task scheduler could easily schedule the service instances. A service has a Service Level Objective (SLO) manager that monitors the load on a service and the capacity it needs to support that load; when the capacity is not sufficient the SLO manager requests additional resources. Perhaps this is like the job manager (one usually keeps the service manager and its SLO-manager separate for various reasons.) Is there a concept of a service scheduler (analogous to job scheduler) in a service grid? Note that unlike a job grid, the set of services in a service-grid do not change as rapidly as the set of jobs in a job grid. The services tend to be fairly persistent while their sets of service instances (i.e. their tasks) are very dynamic. There is a capacity planning function in a service grid which is a fairly manual function in current generation of service grids. The capacity planning function is suppose to ascertain that the SLOs of the services can be satisfied. Since one does not want to configure the service grid with the max resources for each service one has deal with occasional resource shortage. The good news is that often the service mix is such that the peak load on all the services do nit occur at the same time. While that helps, one still has to deal resource shortage; many systems have service priorities and move resources from low priority services to higher priority services (while still leaving the low priority services with a min set of resources as defined in its SLO specification.) Automated service capacity planning is still in early research stage. Getting back to jobs. To recap • The proposal of getting rid of the notion of a job cluster and dynamically scheduling the tasks of job to the available resources on the compute nodes as described in this JIRA makes a lot of sense • The notion of task-level scheduling makes a lot of sense. Indeed I believe we can use the same task scheduler for service instances (i.e. the tasks of a service) scheduling. We should call it a task scheduler - this Jira repeatedly points out that the "job scheduler" is really the "task scheduler. Just call it that. • WE may want the task-scheduler to take action on related tasks of a job (e.g. gang scheduling) and hence the job id should be a visible attribute of a task. • Perhaps we need to introduce a notion of a Job-SLO - this is something that defines the resource requirements of job. Furthermore map-reduce jobs may have to be modeled as 2 sub-jobs with different Job-SLOs and priorities. • Need a job scheduler in addition to the task scheduler. The job scheduler will admit jobs into the grid's run queue when Job-SLOs can be satisfactorily met. It is possible that we may get away with a very simplistic job scheduler for map-reduce jobs. The abstraction of a job scheduler appears to be useful. • May need to add support for multi-phase jobs: each a series of map-reduce jobs where the output of is fed into another. Such multi-phased jobs may need to be scheduled as a collection. • The task tracker concept can probably be made more generic. An important function of the task tracker is to monitor the load of their respective nodes. It is desirable for tasks to be scheduled not merely on number of task slots but on the actual load on the node. There are similarities between services and job grids and it is worth exploring the similarities to get the right abstractions. As far as jobs and services living in the same grid - further investigation is needed. > generalize the TT / JT servers to handle more generic tasks > ----------------------------------------------------------- > > Key: HADOOP-2491 > URL: https://issues.apache.org/jira/browse/HADOOP-2491 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: eric baldeschwieler > > We've been discussing a proposal to generalize the TT / JT servers to handle > more generic tasks and move job specific work out of the job tracker and into > client code so the whole system is both much more general and has more > coherent layering. The result would look more like condor/pbs like systems > (or presumably borg) with map-reduce as a user job. > Such a system would allow the current map-reduce code to coexist with other > work-queuing libraries or maybe even persistent services on the same Hadoop > cluster, although that would be a stretch goal. We'll kick off a thread with > some documents soon. > Our primary goal in going this way would be to get better utilization out of > map-reduce clusters and support a richer scheduling model. The ability to > support alternative job frameworks would just be gravy! > ---- > Putting this in as a place holder. Hope to get folks talking about this to > post some more detail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.