If you want to do profiling on your hadoop cluster,  the starfish project
might be interesting. You can find more info
http://www.cs.duke.edu/starfish/

Thanks,
Shumin
 On Feb 25, 2014 3:31 PM, "Thomas Bentsen" <t...@bentzn.com> wrote:

> Thanks a lot guys!
> From Dieters original reply I got TeraSort and I am currently running
> different scenarios with that. It seems to be The Benchmark right now.
> It's relatively simple and yet it does test most of the functionality.
>
> Devin: You mention a couple of books I already have in the stack for
> reading. Do any of you know of an authoritative source on actual
> optimization (maybe even 'profiling'?) of a Hadoop cluster?
> I am testing on relatively (very) light HW and my background is Java
> servers so I started fiddling with mem-settings - of course. Not much
> luck there. :-D
>
>
> /th
>
>
>
>
>
> On Tue, 2014-02-25 at 15:43 -0500, Devin Suiter RDX wrote:
> > http://sortbenchmark.org/
> >
> >
> > Doesn't just cover Hadoop, but maybe the methodology will give you an
> > idea of what you're looking for.
> >
> >
> > There's too many variables to pin down a "general" average. Every job
> > will run differently on every cluster, given the machines can be
> > heterogenous builds, with heterogenous configs at the machine level,
> > then the cluster will have configs that may or may not override the
> > machine configs...plus the job submitter can specify runtime
> > variables...
> >
> >
> > Things like the type of data being processed affect the amount of disk
> > I/O, network traffic required, etc., which are in turn affected by
> > their components...
> >
> >
> > Throwing more nodes at a problem will usually make it faster, but how
> > much faster depends...
> >
> >
> > Best way to read your cluster is establish a benchmark operation that
> > models your expected use case (or one of them), then adjust things on
> > the cluster and see what tips the time, spill, network traffic, etc.
> > one way or another.
> >
> >
> > Eric Sammer's Hadoop Operations will break down nicely how real-life
> > cluster configs affect performance. There are also a lot of case
> > studies in Tom White's  Hadoop: The Definitive Guide.
> >
> > Devin Suiter
> > Jr. Data Solutions Software Engineer
> >
> > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> > Google Voice: 412-256-8556 | www.rdx.com
> >
> >
> > On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin
> > <bstem...@rightaction.com> wrote:
> >         Part of the problem is the word, "process."  That could be
> >         really complicated or really easy.  It could also be done in
> >         Java or some other language via the streaming JAR.
> >
> >
> >         It's hard for anyone to say without more details.  Even with
> >         more details, its still pretty hard to say.
> >
> >
> >         Brian
> >
> >
> >         On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen
> >         <t...@bentzn.com> wrote:
> >                 Thanks Dieter!
> >                 I'll look into it.
> >
> >                 Still... It would be nice to hear something from the
> >                 real world. Would
> >                 any of you working with Hadoop in a prod env be
> >                 willing to share
> >                 something?
> >
> >                 /th
> >
> >
> >
> >
> >                 On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte
> >                 wrote:
> >                 > Hi,
> >                 >
> >                 > The terasort benchmark is probably the most common.
> >                 It has mappers and
> >                 > reducers doing 'nothing', this way you only use the
> >                 framework's
> >                 > mergesort functionalities.
> >                 >
> >                 >
> >                 > Regards, Dieter
> >                 >
> >                 >
> >                 >
> >                 > 2014-02-24 16:42 GMT+01:00 Thomas Bentsen
> >                 <t...@bentzn.com>:
> >                 >         Hi everyone
> >                 >
> >                 >         I am still beginning Hadoop.
> >                 >         Is there any benchmarks or 'performance
> >                 heuristics' for
> >                 >         Hadoop?
> >                 >         Is it possible to say something like 'You
> >                 can process X lines
> >                 >         of GZipped
> >                 >         log file on a medium AWS server in Y
> >                 minutes"? I would like to
> >                 >         get an
> >                 >         idea of what kind of workflow is possible.
> >                 >
> >                 >         Thanks in advance
> >                 >
> >                 >         Thomas Bentsen
> >                 >
> >                 >
> >                 >
> >
> >
> >
> >
> >
> >
> >
>
>
>

Reply via email to