Sounds like this paper might help you:

Predicting Multiple Performance Metrics for Queries: Better Decisions
Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David
Patterson

http://radlab.cs.berkeley.edu/publication/187

On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> some additional thoughts about the the  'variables' involved in
> characterizing the M/R application itself.
>
>
>   - the configuration of the cluster for numbers of mappers vs reducers
>   compared to the characteristics (amount of work/procesing) required in each
>   of the map/shuffle/reduce stages
>
>
>   - is the application using multiple chained M/R stages?  Multi stage
>   M/R's are more difficult to tune properly in terms of keeping all workers
>   busy  . That may be challenging to model.
>
> 2011/4/16 Stephen Boesch <java...@gmail.com>
>
> > You could consider two scenarios / set of requirements for your estimator:
> >
> >
> >    1. Allow it to 'learn' from certain input data and then project running
> >    times of similar (or moderately dissimilar) workloads.   So the first 
> > steps
> >    could be to define a couple of  relatively small "control" M/R jobs on a
> >    small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ 
> > M/R
> >     cluster.  Try to design the "control" M/R job  in a way that it will be
> >    able to completely load down all of the  available DataNodes in the
> >     cluster-under-test for at least a brief period of time.   Then you wlil
> >    have obtained a decent signal on the capabilities of the cluster under 
> > test
> >    and may allow a relatively high degree of predictive accuracy for even 
> > much
> >    larger jobs
> >    2. If instead it were your goal to drive the predictions off of a
> >    purely mathematical model  - in your terms the "application" and "base 
> > file
> >    system" - and without any empirical data - then here is an alternative
> >    approach.
> >       - Follow step (1) above against a variety of "applications" and
> >       "base file systems" - especially in configurations for which  you 
> > wish your
> >       model to provide high quality predictions.
> >       - Save  the results in structured data
> >       - Derive formulas for characterizing the curves of performance via
> >       those variables that you defined (application /  base file system)
> >
> > Now you have a trained model.  When it is applied to a new set of
> > applications / base file systems it can use the curves you have already
> > determined to provide the result without any runtime requirements.
> >
> > Obviously the value of this second approach is limited by the degree of
> > similarity of the training data to the applications you attempt to model.
> >  If all of your training data is on a 50 node cluster against machines with
> > IDE drives don't expect good results when asked to model a 1000 node cluster
> > using SAN's / RAID's / SCSI's.
> >
> >
> > 2011/4/16 Sonal Goyal <sonalgoy...@gmail.com>
> >
> >> What is your MR job doing? What is the amount of data it is processing?
> >> What
> >> kind of a cluster do you have? Would you be able to share some details
> >> about
> >> what you are trying to do?
> >>
> >> If you are looking for metrics, you could look at the Terasort run ..
> >>
> >> Thanks and Regards,
> >> Sonal
> >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> >> Integration<https://github.com/sonalgoyal/hiho>
> >> Nube Technologies <http://www.nubetech.co>
> >>
> >> <http://in.linkedin.com/in/sonalgoyal>
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Apr 16, 2011 at 3:31 PM, real great..
> >> <greatness.hardn...@gmail.com>wrote:
> >>
> >> > Hi,
> >> > As a part of my final year BE final project I want to estimate the time
> >> > required by a M/R job given an application and a base file system.
> >> > Can you folks please help me by posting some thoughts on this issue or
> >> > posting some links here.
> >> >
> >> > --
> >> > Regards,
> >> > R.V.
> >> >
> >>
> >
> >

Reply via email to