If you want to do profiling on your hadoop cluster, the starfish project might be interesting. You can find more info http://www.cs.duke.edu/starfish/
Thanks, Shumin On Feb 25, 2014 3:31 PM, "Thomas Bentsen" <t...@bentzn.com> wrote: > Thanks a lot guys! > From Dieters original reply I got TeraSort and I am currently running > different scenarios with that. It seems to be The Benchmark right now. > It's relatively simple and yet it does test most of the functionality. > > Devin: You mention a couple of books I already have in the stack for > reading. Do any of you know of an authoritative source on actual > optimization (maybe even 'profiling'?) of a Hadoop cluster? > I am testing on relatively (very) light HW and my background is Java > servers so I started fiddling with mem-settings - of course. Not much > luck there. :-D > > > /th > > > > > > On Tue, 2014-02-25 at 15:43 -0500, Devin Suiter RDX wrote: > > http://sortbenchmark.org/ > > > > > > Doesn't just cover Hadoop, but maybe the methodology will give you an > > idea of what you're looking for. > > > > > > There's too many variables to pin down a "general" average. Every job > > will run differently on every cluster, given the machines can be > > heterogenous builds, with heterogenous configs at the machine level, > > then the cluster will have configs that may or may not override the > > machine configs...plus the job submitter can specify runtime > > variables... > > > > > > Things like the type of data being processed affect the amount of disk > > I/O, network traffic required, etc., which are in turn affected by > > their components... > > > > > > Throwing more nodes at a problem will usually make it faster, but how > > much faster depends... > > > > > > Best way to read your cluster is establish a benchmark operation that > > models your expected use case (or one of them), then adjust things on > > the cluster and see what tips the time, spill, network traffic, etc. > > one way or another. > > > > > > Eric Sammer's Hadoop Operations will break down nicely how real-life > > cluster configs affect performance. There are also a lot of case > > studies in Tom White's Hadoop: The Definitive Guide. > > > > Devin Suiter > > Jr. Data Solutions Software Engineer > > > > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > > Google Voice: 412-256-8556 | www.rdx.com > > > > > > On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin > > <bstem...@rightaction.com> wrote: > > Part of the problem is the word, "process." That could be > > really complicated or really easy. It could also be done in > > Java or some other language via the streaming JAR. > > > > > > It's hard for anyone to say without more details. Even with > > more details, its still pretty hard to say. > > > > > > Brian > > > > > > On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen > > <t...@bentzn.com> wrote: > > Thanks Dieter! > > I'll look into it. > > > > Still... It would be nice to hear something from the > > real world. Would > > any of you working with Hadoop in a prod env be > > willing to share > > something? > > > > /th > > > > > > > > > > On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte > > wrote: > > > Hi, > > > > > > The terasort benchmark is probably the most common. > > It has mappers and > > > reducers doing 'nothing', this way you only use the > > framework's > > > mergesort functionalities. > > > > > > > > > Regards, Dieter > > > > > > > > > > > > 2014-02-24 16:42 GMT+01:00 Thomas Bentsen > > <t...@bentzn.com>: > > > Hi everyone > > > > > > I am still beginning Hadoop. > > > Is there any benchmarks or 'performance > > heuristics' for > > > Hadoop? > > > Is it possible to say something like 'You > > can process X lines > > > of GZipped > > > log file on a medium AWS server in Y > > minutes"? I would like to > > > get an > > > idea of what kind of workflow is possible. > > > > > > Thanks in advance > > > > > > Thomas Bentsen > > > > > > > > > > > > > > > > > > > > > > > > > >