Hey Andrew, I can name 3 California universities (San Diego, Caltech, Santa-Barbera) that use Hadoop at a small (~20TB raw) or medium scale (~800TB raw). Why not go talk to those guys?
Otherwise, you might just be able to confirm old hardware is old (there's good money that you might be hard-drive limited, not network limited anyway. 3.4MB/s triple replicated = 10MB/s on PATA, which might approach the hardware capability). Alternately, you can always try running on Amazon, which allows you to test scaling at a very, very marginal cost. Brian On Apr 13, 2010, at 1:40 PM, Andrew Nguyen wrote: > Good to know... The problem is that I'm in an academic environment that > needs a lot of convincing regarding new computational technologies. I need > to show proven benefit before getting the funds to actually implement > anything. These servers were the best I could come up with for this > proof-of-concept. > > I changed some settings on the nodes and have been experimenting - and I'm > seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your > observations below. > > Given that, would increasing the block sizes help my performance? This > should result in fewer map jobs and keeping the computation locally, > longer...? I just need to show that the numbers are better than a single > machine, even if sacrificing redundancy (or other factors) in the current > setup. > > @alex: > > Thanks for the links, it gives me another bit of evidence to convince > those controlling the money flow... > > --Andrew > > On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon <t...@cloudera.com> wrote: >> On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen < >> andrew-lists-had...@ucsfcti.org> wrote: >> >>> I don't think you can :-). Sorry, they are 100Mbps NIC's... I get >>> 95Mbit/sec from one node to another with iperf. >>> >>> Should I still be expecting such dismal performance with just 100Mbps? >>> >> >> Yes - in my experience on gigabit, when lots of transfers are going > between >> the nodes, TCP performance actually drops to around half the network >> capacity. In the case of 100Mbps, this is probably going to be around >> 5MB/sec >> >> So when you're writing output at 3x replication, it's going to be very > very >> slow on this network. >> >> -Todd >> >> >>> >>> On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: >>> >>>> On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen < >>>> andrew-lists-had...@ucsfcti.org> wrote: >>>> >>>>> 5 identically spec'ed nodes, each has: >>>>> >>>>> 2 GB RAM >>>>> Pentium 4 3.0G with HT >>>>> 250GB HDD on PATA >>>>> 10Mbps NIC >>>>> >>>> >>>> This is probably your issue - 10mbps nic? I didn't know you could > even >>> get >>>> those anymore! >>>> >>>> Hadoop runs on commodity hardware, but you're not likely to get >>> reasonable >>>> performance with hardware like that. >>>> >>>> -Todd >>>> >>>> >>>>> On Apr 12, 2010, at 11:58 AM, alex kamil wrote: >>>>> >>>>>> Andrew, >>>>>> >>>>>> I would also suggest to run DFSIO benchmark to isolate io related >>> issues >>>>>> >>>>>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 >>> -fileSize >>>>> 1000 >>>>>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 >>>>>> -fileSize >>>>> 1000 >>>>>> >>>>>> there are additional tests specific for mapreduce - run "hadoop > jar >>>>> hadoop-0.20.2-test.jar" for the complete list >>>>>> >>>>>> 45 min for mapping 6GB on 5 nodes is way too high assuming your >>>>> gain/offset conversion is a simple algebraic manipulation >>>>>> >>>>>> it takes less than 5 min to run a simple mapper (using streaming) >>>>>> on a >>>>> 4 nodes cluster on something like 10GB, the mapper i used was an awk >>> command >>>>> extracting <key:value> pair from a log (no reducer) >>>>>> >>>>>> Thanks >>>>>> Alex >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> >>> wrote: >>>>>> Hi Andrew, >>>>>> >>>>>> Do you need the sorting behavior that having an identity reducer >>>>>> gives >>>>> you? >>>>>> If not, set the number of reduce tasks to 0 and you'll end up with > a >>> map >>>>>> only job, which should be significantly faster. >>>>>> >>>>>> -Todd >>>>>> >>>>>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < >>>>>> andrew-lists-had...@ucsfcti.org> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I recently setup a 5 node cluster (1 master, 4 slaves) and am >>>>>>> looking >>>>> to >>>>>>> use it to process high volumes of patient physiologic data. As an >>>>> initial >>>>>>> exercise to gain a better understanding, I have attempted to run > the >>>>>>> following problem (which isn't the type of problem that Hadoop was >>>>> really >>>>>>> designed for, as is my understanding). >>>>>>> >>>>>>> I have a 6G data file, that contains key/value of <sample number, >>>>> sample >>>>>>> value>. I'd like to convert the values based on a gain/offset to >>> their >>>>>>> physical units. I've setup a MapReduce job using streaming where >>>>>>> the >>>>> mapper >>>>>>> does the conversion, and the reducer is just an identity reducer. >>>>> Based on >>>>>>> other threads on the mailing list, my initial results are > consistent >>> in >>>>> the >>>>>>> fact that it takes considerably more time to process this in > Hadoop >>>>> then it >>>>>>> is on my Macbook pro (45 minutes vs. 13 minutes). The input is a >>>>> single 6G >>>>>>> file and it looks like the file is being split into 101 map tasks. >>>>> This is >>>>>>> consistent with the 64M block sizes. >>>>>>> >>>>>>> So my questions are: >>>>>>> >>>>>>> * Would it help to increase the block size to 128M? Or, decrease >>>>>>> the >>>>> block >>>>>>> size? What are some key factors to think about with this > question? >>>>>>> * Are there any other optimizations that I could employ? I have >>> looked >>>>>>> into LzoCompression but I'd like to still work without compression >>>>> since the >>>>>>> single thread job that I'm comparing to doesn't use any sort of >>>>> compression. >>>>>>> I know I'm comparing apples to pears a little here so please feel >>>>>>> free >>>>> to >>>>>>> correct this assumption. >>>>>>> * Is Hadoop really only good for jobs where the data doesn't fit > on >>>>>>> a >>>>>>> single node? At some level, I assume that it can still speedup > jobs >>>>> that do >>>>>>> fit on one node, if only because you are performing tasks in >>>>>>> parallel. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> --Andrew >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Todd Lipcon >>>>>> Software Engineer, Cloudera >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Todd Lipcon >>>> Software Engineer, Cloudera >>> >>>
smime.p7s
Description: S/MIME cryptographic signature