[ https://issues.apache.org/jira/browse/HADOOP-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harsh J resolved HADOOP-3421. ----------------------------- Resolution: Duplicate Resolving as dupe of MAPREDUCE-279. Although, this is much better doc-wise and serves as a good reference. Please reopen if I missed something that the other didn't provide (and was the goal here). > Requirements for a Resource Manager for Hadoop > ---------------------------------------------- > > Key: HADOOP-3421 > URL: https://issues.apache.org/jira/browse/HADOOP-3421 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Vivek Ratan > > This is a proposal to extend the scheduling functionality of Hadoop to allow > sharing of large clusters without the use of HOD. We're suffering from > performance issues with HOD and not finding it the right model for running > jobs. We have concluded that a native Hadoop Resource Manager would be more > useful to many people if it supported the features we need for sharing > clusters across large groups and organizations. > Below are the key requirements for a Resource Manager for Hadoop. First, some > terminology used in this writeup: > * *RM*: Resource Manager. What we're building. > * *MR*: Map Reduce. > * A *job* is an MR job for now, but can be any request. Jobs are submitted by > users to the Grid. MR jobs are made up of units of computation called *tasks*. > * A grid has a variety of *resources* of different *capacities* that are > allocated to tasks. For the the early version of the grid, the only resource > considered is a Map or Reduce slot, which can execute a task. Each slot can > run one or more tasks. Later versions may look at resources such as local > temporary storage or CPUs. > * *V1*: version 1. Some features are simplified for V1. > h3. Orgs, queues, users, jobs > Organizations (*Orgs*) are distinct entities for administration, > configuration, billing and reporting purposes. *Users* belong to Orgs. Orgs > have *queues* of jobs, where a queue represents a collection of jobs that > share some scheduling criteria. > * *1.1.* For V1, each queue will belong to one Org and each Org will have > one queue. > * *1.2.* Jobs are submitted to queues. A single job can be submitted to > only one queue. It follows that a job will have a user and an Org associated > with it. > * *1.3.* A user can belong to multiple Orgs and can potentially submit > jobs to multiple queues. > * *1.4.* Orgs are guaranteed a fraction of the capacity of the grid (their > 'guaranteed capacity') in the sense that a certain capacity of resources will > be at their disposal. All jobs submitted to the queues of an Org will have > access to the capacity guaranteed to the Org. > ** Note: it is expected that the sum of the guaranteed capacity of each > Org should equal the resources in the Grid. If the sum is lower, some > resources will not be used. If the sum is higher, the RM cannot maintain > guarantees for all Orgs. > * *1.5.* At any given time, free resources can be allocated to any Org > beyond their guaranteed capacity. For example this may be in the proportion > of guaranteed capacities of various Orgs or some other way. However, these > excess allocated resources can be reclaimed and made available to another Org > in order to meet its capacity guarantee. > * *1.6.* N minutes after an org reclaims resources, it should have all its > reserved capacity available. Put another way, the system will guarantee that > excess resources taken from an Org will be restored to it within N minutes of > its need for them. > * *1.7.* Queues have access control. Queues can specify which users are > (not) allowed to submit jobs to it. A user's job submission will be rejected > if the user does not have access rights to the queue. > h3. Job capacity > * *2.1.* Users will just submit jobs to the Grid. They do not need to > specify the capacity required for their jobs (i.e. how many parallel tasks > the job needs). [Most MR jobs are elastic and do not require a fixed number > of parallel tasks to run - they can run with as little or as much task > parallelism as they can get. This amount of task parallelism is usually > limited by the number of mappers required (which is computed by the system > and not by the user) or the amount of free resources available in the grid. > In most cases, the user wants to just submit a job and let the system take > care of utilizing as many or as little resources as it can.] > h3. Priorities > * *3.1.* Jobs can optionally have priorities associated with them. For V1, > we support the same set of priorities available to MR jobs today. > * *3.2.* Queues can optionally support priorities for jobs. By default, a > queue does not support priorities, in which case it will ignore (with a > warning) any priority levels specified by jobs submitted to it. If a queue > does support priorities, it will have a default priority associated with it, > which is assigned to jobs that don't have priorities. Reqs 3.1 and 3.2 > together mean this: if a queue supports priorities, then a job is assigned > the default priority if it doesn't have one specified, else the job's > specified priority is used. If a queue does not support priorities, then it > ignores priorities specified for jobs. > * *3.3.* Within a queue, jobs with higher priority will have access to the > queue's resources before jobs with lower priority. However, once a job is > running, it will not be preempted (i.e., stopped and restarted) for a higher > priority job. What this also means is that comparison of priorities makes > sense within queues, and not across them. > h3. Fairness/limits > * *4.1.* In order to prevent one or more users from monopolizing its > resources, each queue enforces a limit on the percentage of resources > allocated to a user at any given time, if there is competition for them. This > user limit can vary between a minimum and maximum value. For V1, all users > have the same limit, whose maximum value is dictated by the number of users > who have submitted jobs, and whose minimum value is a pre-configured value > UL-MIN. For example, suppose UL-MIN is 25. If two users have submitted jobs > to a queue, no single user can use more than 50% of the queue resources. If a > third user submits a job, no single user can use more than 33% of the queue > resources. With 4 or more users, no user can use more than 25% of the queue's > resources. > ** Limits apply to newer scheduling, i.e., running jobs or tasks will > not be preempted. > ** The value of UL-MIN can be set differently per Org. > h3. Job queue interaction > * *5.1.* Interaction with the Job queue should be through a command line > interface and a web-based GUI. > * *5.2.* All queues are visible to all users. The Web UI will provide a > single-page view of all queues. > * *5.3.* Users should be able to delete their queued jobs at any time. > * *5.4.* Users should be able to see capacity statistics for various Orgs > (what is the capacity allocated, how much is being used, etc.) > * *5.5.* Existing utilities, such as the *hadoop job -list* command, > should be enhanced to show additional attributes that are relevant. For e.g. > which queue is associated with the job. > h3. Accounting > * *6.1.* The RM must provide accounting information in a manner that can > be easily consumed by external plug-ins or utilities to integrate with 3rd > party accounting systems. The accounting information should comprise of the > following information: > ** Username running the Hadoop job, > ** job id, > ** job name, > ** queue to which job was submitted and organization owning the queue, > ** number of resource units (for e.g. slots) used > ** number of maps / reduces, > ** timings - time of entry into the queue, start and end times of the > job, perhaps total node hours, > ** status of the job - success, failed, killed, etc. > * *6.2.* To assist deployments which do not require accounting, it should > be possible to turn off this feature. > h3. Availability > * *7.1* Job state needs to be persisted (RM restarts should not cause jobs > to die) > h3. Scalability > * *8.1.* Scale to 3k+ nodes > * *8.2.* Scale to handle 1k+ large submitted jobs, each with 100k+ tasks > h3. Configuration > * *9.1.* The system must provide a mechanism to create and delete > organizations, and queues within the organizations. It must also provide a > mechanism to configure various properties of these objects. > * *9.2.* Only users with administrative privileges can perform operations > of managing and configuring these objects in the system. > * *9.3.* Configuration changes must be effective in the RM without > requiring its restart. They must be effective in a reasonable amount of time > since the modification is made. > * *9.4.* For most of the configurations, it appears that there can be > values at various levels - Grid, organization, queue, user and job. For e.g. > there can be a default value for the resource quota per user at a Grid level, > which can be overridden at an org level, and so on. There must be an easy way > to express these configurations in this hierarchical fashion. Also, values at > a broader level can be overridden by values at a more narrow level. > * *9.5.* There must be appropriate default objects and default values for > their configuration. This is to help deployments that do not need a complex > scheduling setup. > h3. Logging Enhancements > * *10.1.* For purposes of debugging, the Hadoop web UI should provide a > facility to see details of all jobs. While this is mostly supported today, > any changes to meet other requirements, such as scalability, must not affect > this feature. Also, it must be possible to view task logs from Job history UI > (see HADOOP:2165) > * *10.2.* The system must log all relevant events about a job vis-a-vis > scheduling. Particularly, changes in state of a job (queued -> scheduled -> > completed | killed), and events which caused these changes must be logged. > * *10.3.* The system should be able to provide relevant, explanatory > information about the status of job to give feedback to users. This could be > a diagnostic string such as why the job is queued or why it failed. (For e.g. > lack of sufficient resources - how many were asked, how many are available, > exceeding user limits, etc). This information must be available to users, as > well as in the logs for debugging purposes. It should also be possible to > programmatically get this information. > * *10.4.* The host which submitted the job should be part of log messages. > This would assist in debugging. > h3. Security Enhancements > * *11.1.* The RM should provide a mechanism for controlling who can submit > jobs to which queue. This could be done using an ACL mechanism that consists > of an ordered whitelist and blacklist of users. The order can determine which > ACL would apply in case of conflicts. > * *11.2.* The system must provide a mechanism to list users who have > administrative control. Only users in this list should be allowed to modify > configuration related to the RM, like configuration, setting up objects, etc. > * *11.3.* The system should be able to schedule tasks running on behalf of > multiple users concurrently on the same host in a secure manner. > Specifically, this should not require any insecure configuration, such as > requiring 0777 permissions on directories etc. > * *11.4.* The system must follow the security mechanisms being implemented > for Hadoop (HADOOP:1701 and friends). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira