I'm running custom map programs written in C++. What the programs do is very simple. For example, in program 2, for each input line ID node1 node2 ... nodeN the program outputs node1 ID node2 ID ... nodeN ID
Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. I agree that it depends on the scripts. I tried replicating the computation for each input line by 10 times and saw significantly better speedup. But it is still pretty bad that Hadoop streaming has such big overhead for simple programs. I also tried writing program 1 with Hadoop Java API. I got almost 1000% speed up on the cluster. Lin On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> wrote: > are you running a custom map script or a standard linux command like WC? > If > custom, what does your script do? > > How much ram do you have? what are you Java memory settings? > > I used the following setup > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task > max. > > I got the following results > > WC 30-40% speedup > Sort 40% speedup > Grep 5X slowdown (turns out this was due to what you described above... > Grep > is just very highly optimized for command line) > Custom perl script which is essentially a For loop which matches each row > of > a dataset to a set of 100 categories) 60% speedup. > > So I do think that it depends on your script... and some other settings of > yours. > > Theo > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am looking into using Hadoop streaming to parallelize some simple > > programs. So far the performance has been pretty disappointing. > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > capacity > > of each node is 2. The Hadoop version is 0.15. > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > > standalone (on a single CPU core). Program runs for 5 minutes on the > > Hadoop > > cluster and 4.5 minutes in standalone. Both programs run as map-only > jobs. > > > > I understand that there is some overhead in starting up tasks, reading > to > > and writing from the distributed file system. But they do not seem to > > explain all the overhead. Most map tasks are data-local. I modified > > program > > 1 to output nothing and saw the same magnitude of overhead. > > > > The output of top shows that the majority of the CPU time is consumed by > > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). > > So > > I added a profile option (-agentlib:hprof=cpu=samples) to > > mapred.child.java.opts. > > > > The profile results show that most of CPU time is spent in the following > > methods > > > > rank self accum count trace method > > > > 1 23.76% 23.76% 1246 300472 > java.lang.UNIXProcess.waitForProcessExit > > > > 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes > > > > 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > And their stack traces show that these methods are for interacting with > > the > > map program. > > > > > > TRACE 300472: > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > > java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > TRACE 300474: > > > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > line) > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > LineRecordReader.java:136) > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > UTF8ByteArrayUtils.java:157) > > > > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > PipeMapRed.java:348) > > > > TRACE 300479: > > > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > line) > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > LineRecordReader.java:136) > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > UTF8ByteArrayUtils.java:157) > > > > org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > > PipeMapRed.java:399) > > > > TRACE 300478: > > > > > > java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) > > > > java.io.FileOutputStream.write(FileOutputStream.java:260) > > > > java.io.BufferedOutputStream.flushBuffer( > BufferedOutputStream.java > > :65) > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) > > > > java.io.DataOutputStream.flush(DataOutputStream.java:106) > > > > org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) > > > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > > :1760) > > > > > > I don't understand why Hadoop streaming needs so much CPU time to read > > from > > and write to the map program. Note it takes 23.67% time to read from the > > standard error of the map program while the program does not output any > > error at all! > > > > Does anyone know any way to get rid of this seemingly unnecessary > overhead > > in Hadoop streaming? > > > > Thanks, > > > > Lin > > > > > > -- > Theodore Van Rooy > http://greentheo.scroggles.com >