[
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557765#action_12557765
]
hammer edited comment on HADOOP-2510 at 1/10/08 11:47 AM:
---------------------------------------------------------------------
The separation of functionality outlined above (DFS, MR, Cluster Scheduler)
would be fantastic. I certainly respect Doug's experience with large
distributed systems but it seems the logic required to run multiple MapReduce
jobs is different enough from running a single MapReduce job that separate
daemons would provide a much cleaner implementation.
was (Author: hammer):
The separation of daemons outlined above (DFS, MR, Cluster Scheduler) would
be fantastic. I certainly respect Doug's experience with large distributed
systems but it seems the logic required to run multiple MapReduce jobs is
different enough from running a single MapReduce job that separate daemons
would provide a much cleaner implementation.
> Map-Reduce 2.0
> --------------
>
> Key: HADOOP-2510
> URL: https://issues.apache.org/jira/browse/HADOOP-2510
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
>
> We, at Yahoo!, have been using Hadoop-On-Demand as the resource
> provisioning/scheduling mechanism.
> With HoD the user uses a self-service system to ask-for a set of nodes. HoD
> allocates these from a global pool and also provisions a private Map-Reduce
> cluster for the user. She then runs her jobs and shuts the cluster down via
> HoD when done. All user-private clusters use the same humongous, static HDFS
> (e.g. 2k node HDFS).
> More details about HoD are available here: HADOOP-1301.
> ----
> h3. Motivation
> The current deployment (Hadoop + HoD) has a couple of implications:
> * _Non-optimal Cluster Utilization_
> 1. Job-private Map-Reduce clusters imply that the user-cluster potentially
> could be *idle* for atleast a while before being detected and shut-down.
> 2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with
> much-smaller no. of reduces; with maps being light and quick and reduces
> being i/o heavy and longer-running. Users typically allocate clusters
> depending on the no. of maps (i.e. input size) which leads to the scenario
> where all the maps are done (idle nodes in the cluster) and the few reduces
> are chugging along. Right now, we do not have the ability to shrink the
> HoD'ed Map-Reduce clusters which would alleviate this issue.
> * _Impact on data-locality_
> With the current setup of a static, large HDFS and much smaller (5/10/20/50
> node) clusters there is a good chance of losing one of Map-Reduce's primary
> features: ability to execute tasks on the datanodes where the input splits
> are located. In fact, we have seen the data-local tasks go down to 20-25
> percent in the GridMix benchmarks, from the 95-98 percent we see on the
> randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a
> synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware
> Map-Reduce) helps significantly here.
> ----
> Primarily, the notion of *job-level scheduling* leading to private clusers,
> as opposed to *task-level scheduling*, is a good peg to hang-on the majority
> of the blame.
> Keeping the above factors in mind, here are some thoughts on how to
> re-structure Hadoop Map-Reduce to solve some of these issues.
> ----
> h3. State of the Art
> As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD
> for a bit) does provide task-level scheduling; however as it exists today,
> it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
> Lets review it's current architecture and main components:
> * JobTracker: It does both *task-scheduling* and *task-monitoring*
> (tasktrackers send task-statuses via periodic heartbeats), which implies it
> is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce
> framework i.e. its failure implies that all the jobs in the system fail. This
> means a static, large Map-Reduce cluster is fairly susceptible and a definite
> suspect. Clearly HoD solves this by having per-job clusters, albeit with the
> above drawbacks.
> * TaskTracker: The slave in the system which executes one task at-a-time
> under directions from the JobTracker.
> * JobClient: The per-job client which just submits the job and polls the
> JobTracker for status.
> ----
> h3. Proposal - Map-Reduce 2.0
> The primary idea is to move to task-level scheduling and static Map-Reduce
> clusters (so as to maintain the same storage cluster and compute cluster
> paradigm) as a way to directly tackle the two main issues illustrated above.
> Clearly, we will have to get around the existing problems, especially w.r.t.
> scalability and reliability.
> The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large,
> static cluster.
> Here is an overview of how its main components would look like:
> * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one.
> Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware)
> Maui/Moab are candidates for being the scheduler, in which case, the
> JobScheduler is just a thin wrapper around them.
> * TaskTracker: These stay as before, without some minor changes as
> illustrated later in the piece.
> * JobClient: Fatten up the JobClient my putting a lot more intelligence into
> it. Enhance it to talk to the JobTracker to ask for available TaskTrackers
> and then contact them to schedule and monitor the tasks. So we'll have lots
> of per-job clients talking to the JobScheduler and the relevant TaskTrackers
> for their respective jobs, a big change from today. Lets call this the
> *JobManager* henceforth.
> A broad sketch of how things would work:
> h4. Deployment
> There is a single, static, large Map-Reduce cluster, and no per-job clusters.
> Essentially there is one global JobScheduler with thousands of independent
> TaskTrackers, each running on one node.
> As mentioned previously, the JobScheduler is a pure task-scheduler. When
> contacted by per-job JobManagers querying for TaskTrackers to run their tasks
> on, the JobTracker takes into the account the job priority, data-placements
> (HDFS blocks), current-load/capacity of the TaskTrackers and gives the
> JobManager a free slot for the task(s) in question, if available.
> Each TaskTracker periodically updates the master JobScheduler with
> information about the currently running tasks and available free-slots. It
> waits for the per-job JobManager to contact it for free-slots (which abide
> the JobScheduler's directives) and status for currently-running tasks (of
> course, the JobManager knows exactly which TaskTrackers it needs to talk to).
> The fact that the JobScheduler is no longer doing the heavy-lifting of
> monitoring tasks (like the current JobTracker), and hence the jobs, is the
> key differentiator, which is why it should be very light-weight. (Thus, it is
> even conceivable to imagine a hot-backup of the JobScheduler, topic for
> another discussion.)
> h4. Job Execution
> Here is how the job-execution work-flow looks like:
> * User submits a job,
> * The JobClient, as today, validates inputs, computes the input splits
> etc.
> * Rather than submit the job to the JobTracker which then runs it, the
> JobClient now dons the role of the JobManager as described above (of course
> they could be two independent processes working in conjunction with the
> other... ). The JobManager pro-actively works with the JobScheduler and the
> TaskTrackers to execute the job. While there are more tasks to run for the
> still-running job, it contacts the JobScheduler to get 'n' free slots and
> schedules m tasks (m <= n) on the given TaskTrackers (slots). The JobManager
> also monitors the tasks by contacting the relevant TaskTrackers (it knows
> which of the TaskTrackers are running its tasks).
> h4. Brownie Points
> * With Map-Reduce v2.0, we get reliability/scalability of the current
> (Map-Reduce + HoD) architecture.
> * We get elastic jobs for free since there is no concept of private clusters
> and clearly JobManagers do not need to hold on to the map-nodes when they are
> done.
> * We do get data-locality across all jobs, big or small, since there are no
> off-limit DataNodes (i.e. DataNodes outside the private cluster) for a
> Map-Reduce cluster, as today.
> * From an architectural standpoint, each component in the system (sans the
> global scheduler) is nicely independent and impervious of the other:
> ** A JobManager is responsible for one and only one job, loss of a
> JobManager affects only one job.
> ** A TaskTracker manages only one node, it's loss affects only one node in
> the cluster.
> ** No user-code runs in the JobScheduler since it's a pure scheduler.
> * We can run all of the user-code (input/output formats, split calculation,
> task-output promotion etc.) from the JobManager since it is, by definition,
> the user-client.
> h4. Points to Ponder
> * Given that the JobScheduler, is very light-weight, could we have a
> hot-backup for HA?
> * Discuss the notion of a rack-level aggregator of TaskTracker statuses i.e.
> rather than have every TaskTracker update the JobScheduler, a rack-level
> aggregator could achieve the same?
> * We could have the notion of a JobManager being the proxy process running
> inside the cluster for the JobClient (the job-submitting program which is
> running outside the colo e.g. user's dev box) ... in fact we can think of the
> JobManager being *another kind of task* which needs to be scheduled to run at
> a TaskTracker.
> * Task Isolation via separate vms (vmware/xen) rather than just separate
> jvms?
> h4. How do we get to Map-Reduce 2.0?
> At the risk of sounding hopelessly optimistic, we probably do not have to
> work too much to get here.
> * Clearly the main changes come in the JobTracker/JobClient where we _move_
> the pieces which monitor the job's tasks' progress into the
> JobScheduler/JobManager.
> * We also need to enhance the JobClient (as the JobManager) to get it to
> talk to the JobTracker (JobScheduler) to query for the empty slots, which
> might not be available!
> * Then we need to add RPCs to get the JobClient (JobManager) to talk to the
> given TaskTrackers to get them to run the tasks, thus reversing the direction
> of current RPCs needed to start a task (now the TaskTracker asks the
> JobTracker for tasks to run); we also need new RPCs for the JobClient
> (JobManager) to talk to the TaskTracker to query it's tasks' statuses.
> * We leave the current heartbeat mechanism from the TaskTracker to the
> JobTracker (JobScheduler) as-is, sans the task-statuses.
> h4. Glossary
> * JobScheduler - The global, task-scheduler which is today's JobTracker
> minus the code for tracking/monitoring jobs and their tasks. A pure scheduler.
> * JobManager - The per-job manager which is wholly responsible for working
> with the JobScheduler and TaskTrackers to schedule it's tasks and track their
> progress till job-completion (success/failure). Simplistically it is the
> current JobClient plus the enhancements to enable it to talk to the
> JobScheduler and TaskTrackers for running/monitoring the tasks.
> ----
> h3. Tickets for the Gravy-Train ride
> Eric has started a discussion about generalizing Hadoop to support non-MR
> tasks, a discussion which has surfaced a few times on our lists, at
> HADOOP-2491.
> He notes:
> {quote}
> Our primary goal in going this way would be to get better utilization out of
> map-reduce clusters and support a richer scheduling model. The ability to
> support alternative job frameworks would just be gravy!
> Putting this in as a place holder. Hope to get folks talking about this to
> post some more detail.
> {quote}
> This is the start of the path to the promised gravy-land. *smile*
> We believe Map-Reduce 2.0 is a good start in moving most (if not all) of the
> Map-Reduce specific code into the user-clients (i.e. JobManager) and taking a
> shot at generalizing the JobTracker (as the JobScheduler) and the TaskTracker
> to handle more generic tasks via different (smarter/dumber) user-clients.
> ----
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.