[ 
https://issues.apache.org/jira/browse/MAPREDUCE-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757233#action_12757233
 ] 

Hong Tang commented on MAPREDUCE-728:
-------------------------------------

The attached patch is the first step toward many things that it could enable us 
to do. The following is a follow-up of previous comments in this jira.

bq. I have one item of high-level feedback. It looks like Mumak has two 
components - a simulator and a trace-driven workload generator. It would be 
nice if the workload generator was pluggable so that the simulator could be 
used on synthetic workloads without requiring a trace. For example, one should 
be able to create a simulated cluster where some given node is always slow, or 
fails partway through, etc. Then the simulator could be used in unit tests, 
simplifying a lot of the testing code in various schedulers.

In the patch, this is very close to what we did (after MAPREDUCE-966). The 
dependency between Mumak and Rumen (the load generator) comes down to four 
interfaces: JobStory, JobStoryProducer, ClusterStory. And currently 
JobStoryProducer maps to SimulatorJobStory, and ClusterStory maps to 
ZombieCluster. It should be very easy to make them plugable, and I will create 
a Jira to track this.

bq. Then the simulator could be used in unit tests, simplifying a lot of the 
testing code in various schedulers.
Yes, the unit tests included were written in this way. And it showed two 
possible bugs in recent changes in JobHistory by MAPREDUCE-157 (MAPREDUCE-995, 
and MAPREDUCE-1000), this is actually a pleasant surprise to me, we were 
thinking of using Mumak for design proof or performance validation, but our 
design choice to use the actual scheduler code and JT code also makes it a JT 
debugger.

bq. What will be done about speculative tasks? • Will Mumak simulate 
high-memory jobs? 
Neither is done in this patch. But I agree these are things we should simulator 
in follow-up improvements.

bq. The schedulers and the JobTracker currently have some threads that perform 
an operation periodically and sleep in-between doing so. 
We (partially) solve the problem by using aspectJ so that the threads would 
become no-op. The reason I say it is a partial solution is that the threads are 
still active, and that the interception logic is not something you can 
mechanically determine, but in most cases are very straightforward to identify.

> Mumak: Map-Reduce Simulator
> ---------------------------
>
>                 Key: MAPREDUCE-728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-728
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>            Reporter: Arun C Murthy
>            Assignee: Hong Tang
>             Fix For: 0.21.0
>
>         Attachments: 19-jobs.topology.json.gz, 19-jobs.trace.json.gz, 
> mapreduce-728-20090917-3.patch, mapreduce-728-20090917-4.patch, 
> mapreduce-728-20090917.patch, mumak.png
>
>
> h3. Vision:
> We want to build a Simulator to simulate large-scale Hadoop clusters, 
> applications and workloads. This would be invaluable in furthering Hadoop by 
> providing a tool for researchers and developers to prototype features (e.g. 
> pluggable block-placement for HDFS, Map-Reduce schedulers etc.) and predict 
> their behaviour and performance with reasonable amount of confidence, 
> there-by aiding rapid innovation.
> ----
> h3. First Cut: Simulator for the Map-Reduce Scheduler
> The Map-Reduce Scheduler is a fertile area of interest with at least four 
> schedulers, each with their own set of features, currently in existence: 
> Default Scheduler, Capacity Scheduler, Fairshare Scheduler & Priority 
> Scheduler.
> Each scheduler's scheduling decisions are driven by many factors, such as 
> fairness, capacity guarantee, resource availability, data-locality etc.
> Given that, it is non-trivial to accurately choose a single scheduler or even 
> a set of desired features to predict the right scheduler (or features) for a 
> given workload. Hence a simulator which can predict how well a particular 
> scheduler works for some specific workload by quickly iterating over 
> schedulers and/or scheduler features would be quite useful.
> So, the first cut is to implement a simulator for the Map-Reduce scheduler 
> which take as input a job trace derived from production workload and a 
> cluster definition, and simulates the execution of the jobs in as defined in 
> the trace in this virtual cluster. As output, the detailed job execution 
> trace (recorded in relation to virtual simulated time) could then be analyzed 
> to understand various traits of individual schedulers (individual jobs turn 
> around time, throughput, faireness, capacity guarantee, etc). To support 
> this, we would need a simulator which could accurately model the conditions 
> of the actual system which would affect a schedulers decisions. These include 
> very large-scale clusters (thousands of nodes), the detailed characteristics 
> of the workload thrown at the clusters, job or task failures, data locality, 
> and cluster hardware (cpu, memory, disk i/o, network i/o, network topology) 
> etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to