are you running a custom map script or a standard linux command like WC? If custom, what does your script do?
How much ram do you have? what are you Java memory settings? I used the following setup 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task max. I got the following results WC 30-40% speedup Sort 40% speedup Grep 5X slowdown (turns out this was due to what you described above... Grep is just very highly optimized for command line) Custom perl script which is essentially a For loop which matches each row of a dataset to a set of 100 categories) 60% speedup. So I do think that it depends on your script... and some other settings of yours. Theo On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > Hi, > > I am looking into using Hadoop streaming to parallelize some simple > programs. So far the performance has been pretty disappointing. > > The cluster contains 5 nodes. Each node has two CPU cores. The task > capacity > of each node is 2. The Hadoop version is 0.15. > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > standalone (on a single CPU core). Program runs for 5 minutes on the > Hadoop > cluster and 4.5 minutes in standalone. Both programs run as map-only jobs. > > I understand that there is some overhead in starting up tasks, reading to > and writing from the distributed file system. But they do not seem to > explain all the overhead. Most map tasks are data-local. I modified > program > 1 to output nothing and saw the same magnitude of overhead. > > The output of top shows that the majority of the CPU time is consumed by > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). > So > I added a profile option (-agentlib:hprof=cpu=samples) to > mapred.child.java.opts. > > The profile results show that most of CPU time is spent in the following > methods > > rank self accum count trace method > > 1 23.76% 23.76% 1246 300472 java.lang.UNIXProcess.waitForProcessExit > > 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes > > 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > And their stack traces show that these methods are for interacting with > the > map program. > > > TRACE 300472: > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > TRACE 300474: > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > line) > > java.io.FileInputStream.read(FileInputStream.java:199) > > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > org.apache.hadoop.mapred.LineRecordReader.readLine( > LineRecordReader.java:136) > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > UTF8ByteArrayUtils.java:157) > > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > PipeMapRed.java:348) > > TRACE 300479: > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > line) > > java.io.FileInputStream.read(FileInputStream.java:199) > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > org.apache.hadoop.mapred.LineRecordReader.readLine( > LineRecordReader.java:136) > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > UTF8ByteArrayUtils.java:157) > > org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > PipeMapRed.java:399) > > TRACE 300478: > > > java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) > > java.io.FileOutputStream.write(FileOutputStream.java:260) > > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java > :65) > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) > > java.io.DataOutputStream.flush(DataOutputStream.java:106) > > org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > :1760) > > > I don't understand why Hadoop streaming needs so much CPU time to read > from > and write to the map program. Note it takes 23.67% time to read from the > standard error of the map program while the program does not output any > error at all! > > Does anyone know any way to get rid of this seemingly unnecessary overhead > in Hadoop streaming? > > Thanks, > > Lin > -- Theodore Van Rooy http://greentheo.scroggles.com