Sounds like this paper might help you: Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno, Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David Patterson
http://radlab.cs.berkeley.edu/publication/187 On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch <java...@gmail.com> wrote: > > some additional thoughts about the the 'variables' involved in > characterizing the M/R application itself. > > > - the configuration of the cluster for numbers of mappers vs reducers > compared to the characteristics (amount of work/procesing) required in each > of the map/shuffle/reduce stages > > > - is the application using multiple chained M/R stages? Multi stage > M/R's are more difficult to tune properly in terms of keeping all workers > busy . That may be challenging to model. > > 2011/4/16 Stephen Boesch <java...@gmail.com> > > > You could consider two scenarios / set of requirements for your estimator: > > > > > > 1. Allow it to 'learn' from certain input data and then project running > > times of similar (or moderately dissimilar) workloads. So the first > > steps > > could be to define a couple of relatively small "control" M/R jobs on a > > small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ > > M/R > > cluster. Try to design the "control" M/R job in a way that it will be > > able to completely load down all of the available DataNodes in the > > cluster-under-test for at least a brief period of time. Then you wlil > > have obtained a decent signal on the capabilities of the cluster under > > test > > and may allow a relatively high degree of predictive accuracy for even > > much > > larger jobs > > 2. If instead it were your goal to drive the predictions off of a > > purely mathematical model - in your terms the "application" and "base > > file > > system" - and without any empirical data - then here is an alternative > > approach. > > - Follow step (1) above against a variety of "applications" and > > "base file systems" - especially in configurations for which you > > wish your > > model to provide high quality predictions. > > - Save the results in structured data > > - Derive formulas for characterizing the curves of performance via > > those variables that you defined (application / base file system) > > > > Now you have a trained model. When it is applied to a new set of > > applications / base file systems it can use the curves you have already > > determined to provide the result without any runtime requirements. > > > > Obviously the value of this second approach is limited by the degree of > > similarity of the training data to the applications you attempt to model. > > If all of your training data is on a 50 node cluster against machines with > > IDE drives don't expect good results when asked to model a 1000 node cluster > > using SAN's / RAID's / SCSI's. > > > > > > 2011/4/16 Sonal Goyal <sonalgoy...@gmail.com> > > > >> What is your MR job doing? What is the amount of data it is processing? > >> What > >> kind of a cluster do you have? Would you be able to share some details > >> about > >> what you are trying to do? > >> > >> If you are looking for metrics, you could look at the Terasort run .. > >> > >> Thanks and Regards, > >> Sonal > >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data > >> Integration<https://github.com/sonalgoyal/hiho> > >> Nube Technologies <http://www.nubetech.co> > >> > >> <http://in.linkedin.com/in/sonalgoyal> > >> > >> > >> > >> > >> > >> On Sat, Apr 16, 2011 at 3:31 PM, real great.. > >> <greatness.hardn...@gmail.com>wrote: > >> > >> > Hi, > >> > As a part of my final year BE final project I want to estimate the time > >> > required by a M/R job given an application and a base file system. > >> > Can you folks please help me by posting some thoughts on this issue or > >> > posting some links here. > >> > > >> > -- > >> > Regards, > >> > R.V. > >> > > >> > > > >