Re: recommendation on HDDs

2011-02-18 Thread Ted Dunning
Better to provide a summary as well as the link On Friday, February 18, 2011, Shrinivas Joshi wrote: > There seems to be a wiki page already intended for capturing information on > disks in Hadoop environment. http://wiki.apache.org/hadoop/DiskSetup > > Do we just want to link the thread on HDD

Re: recommendation on HDDs

2011-02-18 Thread Shrinivas Joshi
There seems to be a wiki page already intended for capturing information on disks in Hadoop environment. http://wiki.apache.org/hadoop/DiskSetup Do we just want to link the thread on HDD recommendations from this wiki page? -Shrinivas On Tue, Feb 15, 2011 at 11:48 AM, zGreenfelder wrote: > unto

Re: benchmark choices

2011-02-18 Thread Konstantin Boudnik
On Fri, Feb 18, 2011 at 14:35, Ted Dunning wrote: > I just read the malstone report.  They report times for a Java version that > is many (5x) times slower than for a streaming implementation.  That single > fact indicates that the Java code is so appallingly bad that this is a very > bad benchmar

Re: benchmark choices

2011-02-18 Thread Ted Dunning
I just read the malstone report. They report times for a Java version that is many (5x) times slower than for a streaming implementation. That single fact indicates that the Java code is so appallingly bad that this is a very bad benchmark. On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout wrote: >

Re: benchmark choices

2011-02-18 Thread Shrinivas Joshi
Thanks Jim. MRBench mentioned in this paper http://dcslab.snu.ac.kr/~khjeon/papers/2008/icpads_mrbench.pdf looks like a map/reduce port of TPC-H workload. BTW, MRBench mentioned in the above paper and the one in mapred/src/test/mapred/org/apache/hadoop/mapred/MRBench.java look different to me. Is t

RE: benchmark choices

2011-02-18 Thread Jim Falgout
We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the data and the queries, if not the query generator. There is a Jira issue in Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I don't remember the issue number offhand. -Original Message- From: S

Re: How to package multiple jars for a Hadoop job

2011-02-18 Thread Mark Kerzner
Thank you, Mark On Fri, Feb 18, 2011 at 4:23 PM, Eric Sammer wrote: > Mark: > > You have a few options. You can: > > 1. Package dependent jars in a lib/ directory of the jar file. > 2. Use something like Maven's assembly plugin to build a self contained > jar. > > Either way, I'd strongly recomm

Re: How to package multiple jars for a Hadoop job

2011-02-18 Thread Eric Sammer
Mark: You have a few options. You can: 1. Package dependent jars in a lib/ directory of the jar file. 2. Use something like Maven's assembly plugin to build a self contained jar. Either way, I'd strongly recommend using something like Maven to build your artifacts so they're reproducible and in

How to package multiple jars for a Hadoop job

2011-02-18 Thread Mark Kerzner
Hi, I have a script that I use to re-package all the jars (which are output in a dist directory by NetBeans) - and it structures everything correctly into a single jar for running a MapReduce job. Here it is below, but I am not sure if it is the best practice. Besides, it hard-codes my paths. I am

Re: Quick question

2011-02-18 Thread maha
Thanks Ted and Jim :) Maha On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: > That's right. The TextInputFormat handles situations where records cross > split boundaries. What your mapper will see is "whole" records. > > -Original Message- > From: maha [mailto:m...@umail.ucsb.edu] > S

Re: benchmark choices

2011-02-18 Thread Ted Dunning
MalStone looks like a very narrow benchmark. Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it has the characteristic that lots of people use it. You should add PigMix to your list. There java versions of the problems in PigMix that make a pretty good set of benchmarks

benchmark choices

2011-02-18 Thread Shrinivas Joshi
Which workloads are used for serious benchmarking of Hadoop clusters? Do you care about any of the following workloads : TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. Thanks, -Shrinivas

RE: Quick question

2011-02-18 Thread Jim Falgout
That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. -Original Message- From: maha [mailto:m...@umail.ucsb.edu] Sent: Friday, February 18, 2011 1:14 PM To: common-user Subject: Quick question Hi all,

Re: Quick question

2011-02-18 Thread Ted Dunning
The input is effectively split by lines, but under the covers, the actual splits are by byte. Each mapper will cleverly scan from the specified start to the next line after the start point. At then end, it will over-read to the end of line that is at or after the end of its specified region. Thi

Quick question

2011-02-18 Thread maha
Hi all, I want to check if the following statement is right: If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . In other words, the input is not split byte-wise but by lines. Is th

RE: How do I get a JobStatus object?

2011-02-18 Thread Aaron Baff
> On Thu, Feb 17, 2011 at 12:09 AM, Aaron Baff wrote: >> I'm submitting jobs via JobClient.submitJob(JobConf), and then waiting until >> it completes with RunningJob.waitForCompletion(). I then want to get how >> long the entire MR takes, which appears to need the JobStatus since >> RunningJob