Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped, plus it took more than twice as long to upload the data to HDFS-on-S3 in the first place), but still probably relevant.
Andreas Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > I'm running custom map programs written in C++. What the programs do is very > simple. For example, in program 2, for each input line ID node1 node2 > ... nodeN > the program outputs > node1 ID > node2 ID > ... > nodeN ID > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > I agree that it depends on the scripts. I tried replicating the computation > for each input line by 10 times and saw significantly better speedup. But it > is still pretty bad that Hadoop streaming has such big overhead for simple > programs. > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > speed up on the cluster. > > Lin > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > wrote: > > > are you running a custom map script or a standard linux command like WC? > > If > > custom, what does your script do? > > > > How much ram do you have? what are you Java memory settings? > > > > I used the following setup > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task > > max. > > > > I got the following results > > > > WC 30-40% speedup > > Sort 40% speedup > > Grep 5X slowdown (turns out this was due to what you described above... > > Grep > > is just very highly optimized for command line) > > Custom perl script which is essentially a For loop which matches each row > > of > > a dataset to a set of 100 categories) 60% speedup. > > > > So I do think that it depends on your script... and some other settings of > > yours. > > > > Theo > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > programs. So far the performance has been pretty disappointing. > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > capacity > > > of each node is 2. The Hadoop version is 0.15. > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > Hadoop > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > jobs. > > > > > > I understand that there is some overhead in starting up tasks, reading > > to > > > and writing from the distributed file system. But they do not seem to > > > explain all the overhead. Most map tasks are data-local. I modified > > > program > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > The output of top shows that the majority of the CPU time is consumed by > > > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). > > > So > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > mapred.child.java.opts. > > > > > > The profile results show that most of CPU time is spent in the following > > > methods > > > > > > rank self accum count trace method > > > > > > 1 23.76% 23.76% 1246 300472 > > java.lang.UNIXProcess.waitForProcessExit > > > > > > 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes > > > > > > 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > And their stack traces show that these methods are for interacting with > > > the > > > map program. > > > > > > > > > TRACE 300472: > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > > > > java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > > java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > TRACE 300474: > > > > > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > > LineRecordReader.java:136) > > > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > UTF8ByteArrayUtils.java:157) > > > > > > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > > PipeMapRed.java:348) > > > > > > TRACE 300479: > > > > > > java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > > LineRecordReader.java:136) > > > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > UTF8ByteArrayUtils.java:157) > > > > > > org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > > > PipeMapRed.java:399) > > > > > > TRACE 300478: > > > > > > > > > java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) > > > > > > java.io.FileOutputStream.write(FileOutputStream.java:260) > > > > > > java.io.BufferedOutputStream.flushBuffer( > > BufferedOutputStream.java > > > :65) > > > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > > > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) > > > > > > java.io.DataOutputStream.flush(DataOutputStream.java:106) > > > > > > org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) > > > > > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > > > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > > > :1760) > > > > > > > > > I don't understand why Hadoop streaming needs so much CPU time to read > > > from > > > and write to the map program. Note it takes 23.67% time to read from the > > > standard error of the map program while the program does not output any > > > error at all! > > > > > > Does anyone know any way to get rid of this seemingly unnecessary > > overhead > > > in Hadoop streaming? > > > > > > Thanks, > > > > > > Lin > > > > > > > > > > > -- > > Theodore Van Rooy > > http://greentheo.scroggles.com > >
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil
