[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557066#action_12557066
 ] 

Doug Cutting commented on HADOOP-2510:
--------------------------------------

The stated goals of this design are to improve things when running mapreduce on 
a subset of the nodes of a cluster, when HDFS is run on all nodes.  The current 
approach is to run new mapreduce daemons (jobtracker and tasktrackers) for the 
subset.  The problems are that this does not utilize nodes as fully as they 
could be (e.g., during the tail of a job) and it inhibits data locality 
optimizations.

The proposed solution is to split the jobtracker daemon in two, one shared, 
long-running daemon, and a per job daemon.  My concern with this approach is 
that adding a new kind of daemon considerably complicates things.  New classes 
of daemons exponentially increase the number of failure modes that must be 
tested and debugged.  This could be warranted if it permitted greater sharing 
of functionality between systems, reducing the amount of functionality that we 
must maintain.  For example, we could add a general node allocation system, and 
built map-reduce on top of this.  But for that to be a convincingly independent 
layer, we'd need to demonstrate that we can build other, non-mapreduce systems 
on it, e.g., perhaps hdfs, but this proposal doesn't seem to offer that.

I propose that the stated problems can be more simply and directly solved 
without adding a new daemon, but with the existing integrated system.  We can 
add a job parameter naming the maximum number of nodes that will be used 
simultaneously.  Then a single jobtracker for the entire cluster can schedule 
tasks for multiple jobs at a time, each running on different subsets of nodes.  
A cluster of 1000 nodes might be configured to limit jobs to 200 nodes each.  
As jobs are winding down and no longer use all 200 nodes, the next job can use 
those nodes, improving utilization, the first stated goal of this issue.  The 
entire cluster is available to the jobtracker for scheduling, so that it can 
arrange to place tasks on nodes where their data is local, addressing the 
second stated goal of this issue.

Splitting the jobtracker sounds like it would simplify things, since it would 
result in two simpler services, but distributed systems are more impacted by 
the number of kinds of services than by the complexity of a single service.  
Thus perhaps the jobtracker could be better structured internally, to separate 
concerns within its implementation, but I do not yet see an argument for moving 
them to separate services.  That seems like it will only make things less 
reliable: the same logic running in two daemons that could run equivalently in 
a single daemon.

> Map-Reduce 2.0
> --------------
>
>                 Key: HADOOP-2510
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2510
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Arun C Murthy
>
> We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
> provisioning/scheduling mechanism. 
> With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
> allocates these from a global pool and also provisions a private Map-Reduce 
> cluster for the user. She then runs her jobs and shuts the cluster down via 
> HoD when done. All user-private clusters use the same humongous, static HDFS 
> (e.g. 2k node HDFS). 
> More details about HoD are available here: HADOOP-1301.
> ----
> h3. Motivation
> The current deployment (Hadoop + HoD) has a couple of implications:
>  * _Non-optimal Cluster Utilization_
>    1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
> could be *idle* for atleast a while before being detected and shut-down.
>    2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
> much-smaller no. of reduces; with maps being light and quick and reduces 
> being i/o heavy and longer-running. Users typically allocate clusters 
> depending on the no. of maps (i.e. input size) which leads to the scenario 
> where all the maps are done (idle nodes in the cluster) and the few reduces 
> are chugging along. Right now, we do not have the ability to shrink the 
> HoD'ed Map-Reduce clusters which would alleviate this issue. 
>  * _Impact on data-locality_
> With the current setup of a static, large HDFS and much smaller (5/10/20/50 
> node) clusters there is a good chance of losing one of Map-Reduce's primary 
> features: ability to execute tasks on the datanodes where the input splits 
> are located. In fact, we have seen the data-local tasks go down to 20-25 
> percent in the GridMix benchmarks, from the 95-98 percent we see on the 
> randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
> synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
> Map-Reduce) helps significantly here.
> ----
> Primarily, the notion of *job-level scheduling* leading to private clusers, 
> as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
> of the blame.
> Keeping the above factors in mind, here are some thoughts on how to 
> re-structure Hadoop Map-Reduce to solve some of these issues.
> ----
> h3. State of the Art
> As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
> for a bit) does provide task-level scheduling; however as it exists today, 
> it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
> Lets review it's current architecture and main components:
>  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
> (tasktrackers send task-statuses via periodic heartbeats), which implies it 
> is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
> framework i.e. its failure implies that all the jobs in the system fail. This 
> means a static, large Map-Reduce cluster is fairly susceptible and a definite 
> suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
> above drawbacks.
>  * TaskTracker: The slave in the system which executes one task at-a-time 
> under directions from the JobTracker.
>  * JobClient: The per-job client which just submits the job and polls the 
> JobTracker for status. 
> ----
> h3. Proposal - Map-Reduce 2.0 
> The primary idea is to move to task-level scheduling and static Map-Reduce 
> clusters (so as to maintain the same storage cluster and compute cluster 
> paradigm) as a way to directly tackle the two main issues illustrated above. 
> Clearly, we will have to get around the existing problems, especially w.r.t. 
> scalability and reliability.
> The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, 
> static cluster. 
> Here is an overview of how its main components would look like:
>  * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. 
> Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) 
> Maui/Moab are  candidates for being the scheduler, in which case, the 
> JobScheduler is just a thin wrapper around them. 
>  * TaskTracker: These stay as before, without some minor changes as 
> illustrated later in the piece.
>  * JobClient: Fatten up the JobClient my putting a lot more intelligence into 
> it. Enhance it to talk to the JobTracker to ask for available TaskTrackers 
> and then contact them to schedule and monitor the tasks. So we'll have lots 
> of per-job clients talking to the JobScheduler and the relevant TaskTrackers 
> for their respective jobs, a big change from today. Lets call this the 
> *JobManager* henceforth. 
> A broad sketch of how things would work: 
> h4. Deployment
> There is a single, static, large Map-Reduce cluster, and no per-job clusters.
> Essentially there is one global JobScheduler with thousands of independent 
> TaskTrackers, each running on one node.
> As mentioned previously, the JobScheduler is a pure task-scheduler. When 
> contacted by per-job JobManagers querying for TaskTrackers to run their tasks 
> on, the JobTracker takes into the account the job priority, data-placements 
> (HDFS blocks), current-load/capacity of the TaskTrackers and gives the 
> JobManager a free slot for the task(s) in question, if available.
> Each TaskTracker periodically updates the master JobScheduler with 
> information about the currently running tasks and available free-slots. It 
> waits for the per-job JobManager to contact it for free-slots (which abide 
> the JobScheduler's directives) and status for currently-running tasks (of 
> course, the JobManager knows exactly which TaskTrackers it needs to talk to).
> The fact that the JobScheduler is no longer doing the heavy-lifting of 
> monitoring tasks (like the current JobTracker), and hence the jobs, is the 
> key differentiator, which is why it should be very light-weight. (Thus, it is 
> even conceivable to imagine a hot-backup of the JobScheduler, topic for 
> another discussion.)
> h4. Job Execution
> Here is how the job-execution work-flow looks like:
>     * User submits a job,
>     * The JobClient, as today, validates inputs, computes the input splits 
> etc.
>     * Rather than submit the job to the JobTracker which then runs it, the 
> JobClient now dons the role of the JobManager as described above (of course 
> they could be two independent processes working in conjunction with the 
> other... ). The JobManager pro-actively works with the JobScheduler and the 
> TaskTrackers to execute the job. While there are more tasks to run for the 
> still-running job, it contacts the JobScheduler to get 'n' free slots and 
> schedules m tasks (m <= n) on the given TaskTrackers (slots). The JobManager 
> also monitors the tasks by contacting the relevant TaskTrackers (it knows 
> which of the TaskTrackers are running its tasks). 
> h4. Brownie Points
>  *  With Map-Reduce v2.0, we get reliability/scalability of the current 
> (Map-Reduce + HoD) architecture.
>  * We get elastic jobs for free since there is no concept of private clusters 
> and clearly JobManagers do not need to hold on to the map-nodes when they are 
> done.
>  * We do get data-locality across all jobs, big or small, since there are no 
> off-limit DataNodes (i.e. DataNodes outside the private cluster) for a 
> Map-Reduce cluster, as today.
>  * From an architectural standpoint, each component in the system (sans the 
> global scheduler) is nicely independent and impervious of the other:
>   ** A JobManager is responsible for one and only one job, loss of a 
> JobManager affects only one job.
>   ** A TaskTracker manages only one node, it's loss affects only one node in 
> the cluster. 
>   ** No user-code runs in the JobScheduler since it's a pure scheduler.
>  * We can run all of the user-code (input/output formats, split calculation, 
> task-output promotion etc.) from the JobManager since it is, by definition, 
> the user-client. 
> h4. Points to Ponder
>  * Given that the JobScheduler, is very light-weight, could we have a 
> hot-backup for HA?
>  * Discuss the notion of a rack-level aggregator of TaskTracker statuses i.e. 
> rather than have every TaskTracker update the JobScheduler, a rack-level 
> aggregator could achieve the same?
>  * We could have the notion of a JobManager being the proxy process running 
> inside the cluster for the JobClient (the job-submitting program which is 
> running outside the colo e.g. user's dev box) ... in fact we can think of the 
> JobManager being *another kind of task* which needs to be scheduled to run at 
> a TaskTracker. 
>  * Task Isolation via separate vms (vmware/xen) rather than just separate 
> jvms?
> h4. How do we get to Map-Reduce 2.0?
> At the risk of sounding hopelessly optimistic, we probably do not have to 
> work too much to get here.
>  * Clearly the main changes come in the JobTracker/JobClient where we _move_ 
> the pieces which monitor the job's tasks' progress into the 
> JobScheduler/JobManager.
>  * We also need to enhance the JobClient (as the JobManager) to get it to 
> talk to the JobTracker (JobScheduler) to query for the empty slots, which 
> might not be available!
>  * Then we need to add RPCs to get the JobClient (JobManager) to talk to the 
> given TaskTrackers to get them to run the tasks, thus reversing the direction 
> of current RPCs needed to start a task (now the TaskTracker asks the 
> JobTracker for tasks to run); we also need new RPCs for the JobClient 
> (JobManager) to talk to the TaskTracker to query it's tasks' statuses.
>  * We leave the current heartbeat mechanism from the TaskTracker to the 
> JobTracker (JobScheduler) as-is, sans the task-statuses. 
> h4. Glossary
>  * JobScheduler - The global, task-scheduler which is today's JobTracker 
> minus the code for tracking/monitoring jobs and their tasks. A pure scheduler.
>  * JobManager - The per-job manager which is wholly responsible for working 
> with the JobScheduler and TaskTrackers to schedule it's tasks and track their 
> progress till job-completion (success/failure). Simplistically it is the 
> current JobClient plus the enhancements to enable it to talk to the 
> JobScheduler and TaskTrackers for running/monitoring the tasks. 
> ----
> h3. Tickets for the Gravy-Train ride
> Eric has started a discussion about generalizing Hadoop to support non-MR 
> tasks, a discussion which has surfaced a few times on our lists, at 
> HADOOP-2491. 
> He notes:
> {quote}
> Our primary goal in going this way would be to get better utilization out of 
> map-reduce clusters and support a richer scheduling model. The ability to 
> support alternative job frameworks would just be gravy!
> Putting this in as a place holder. Hope to get folks talking about this to 
> post some more detail.
> {quote}
> This is the start of the path to the promised gravy-land. *smile*
> We believe Map-Reduce 2.0 is a good start in moving most (if not all) of the 
> Map-Reduce specific code into the user-clients (i.e. JobManager) and taking a 
> shot at generalizing the JobTracker (as the JobScheduler) and the TaskTracker 
> to handle more generic tasks via different (smarter/dumber) user-clients.
> ----
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to