I don't think you can :-). Sorry, they are 100Mbps NIC's... I get 95Mbit/sec from one node to another with iperf.
Should I still be expecting such dismal performance with just 100Mbps? On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: > On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen < > andrew-lists-had...@ucsfcti.org> wrote: > >> 5 identically spec'ed nodes, each has: >> >> 2 GB RAM >> Pentium 4 3.0G with HT >> 250GB HDD on PATA >> 10Mbps NIC >> > > This is probably your issue - 10mbps nic? I didn't know you could even get > those anymore! > > Hadoop runs on commodity hardware, but you're not likely to get reasonable > performance with hardware like that. > > -Todd > > >> On Apr 12, 2010, at 11:58 AM, alex kamil wrote: >> >>> Andrew, >>> >>> I would also suggest to run DFSIO benchmark to isolate io related issues >>> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize >> 1000 >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize >> 1000 >>> >>> there are additional tests specific for mapreduce - run "hadoop jar >> hadoop-0.20.2-test.jar" for the complete list >>> >>> 45 min for mapping 6GB on 5 nodes is way too high assuming your >> gain/offset conversion is a simple algebraic manipulation >>> >>> it takes less than 5 min to run a simple mapper (using streaming) on a >> 4 nodes cluster on something like 10GB, the mapper i used was an awk command >> extracting <key:value> pair from a log (no reducer) >>> >>> Thanks >>> Alex >>> >>> >>> >>> >>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote: >>> Hi Andrew, >>> >>> Do you need the sorting behavior that having an identity reducer gives >> you? >>> If not, set the number of reduce tasks to 0 and you'll end up with a map >>> only job, which should be significantly faster. >>> >>> -Todd >>> >>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < >>> andrew-lists-had...@ucsfcti.org> wrote: >>> >>>> Hello, >>>> >>>> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking >> to >>>> use it to process high volumes of patient physiologic data. As an >> initial >>>> exercise to gain a better understanding, I have attempted to run the >>>> following problem (which isn't the type of problem that Hadoop was >> really >>>> designed for, as is my understanding). >>>> >>>> I have a 6G data file, that contains key/value of <sample number, >> sample >>>> value>. I'd like to convert the values based on a gain/offset to their >>>> physical units. I've setup a MapReduce job using streaming where the >> mapper >>>> does the conversion, and the reducer is just an identity reducer. >> Based on >>>> other threads on the mailing list, my initial results are consistent in >> the >>>> fact that it takes considerably more time to process this in Hadoop >> then it >>>> is on my Macbook pro (45 minutes vs. 13 minutes). The input is a >> single 6G >>>> file and it looks like the file is being split into 101 map tasks. >> This is >>>> consistent with the 64M block sizes. >>>> >>>> So my questions are: >>>> >>>> * Would it help to increase the block size to 128M? Or, decrease the >> block >>>> size? What are some key factors to think about with this question? >>>> * Are there any other optimizations that I could employ? I have looked >>>> into LzoCompression but I'd like to still work without compression >> since the >>>> single thread job that I'm comparing to doesn't use any sort of >> compression. >>>> I know I'm comparing apples to pears a little here so please feel free >> to >>>> correct this assumption. >>>> * Is Hadoop really only good for jobs where the data doesn't fit on a >>>> single node? At some level, I assume that it can still speedup jobs >> that do >>>> fit on one node, if only because you are performing tasks in parallel. >>>> >>>> Thanks! >>>> >>>> --Andrew >>> >>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera