Well, we would like to use hadoop streaming because our current system is in C++ and it is easier to migrate to hadoop streaming. Also we have very strict performance requirements. Java seems to be too slow. I rewrote the first program in Java and it runs 4 to 5 times slower than the C++ one.
On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Hadoop can't split a gzipped file so you will only get as many maps as you > have files. > > Why the obsession with hadoop streaming? It is at best a jury rigged > solution. > > > On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > > > Does Hadoop automatically decompress the gzipped file? I only have a > single > > input file. Does it have to be splitted and then gzipped? > > > > I gzipped the input file and Hadoop only created one map task. Still > java is > > using more than 90% CPU. > > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > > wrote: > > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > >> provide the input files gzipped. Not great difference (e.g. 50% slower > >> when not gzipped, plus it took more than twice as long to upload the > >> data to HDFS-on-S3 in the first place), but still probably relevant. > >> > >> Andreas > >> > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > >>> I'm running custom map programs written in C++. What the programs do > is > >> very > >>> simple. For example, in program 2, for each input line ID node1 > >> node2 > >>> ... nodeN > >>> the program outputs > >>> node1 ID > >>> node2 ID > >>> ... > >>> nodeN ID > >>> > >>> Each node has 4GB to 8GB of memory. The java memory setting is > -Xmx300m. > >>> > >>> I agree that it depends on the scripts. I tried replicating the > >> computation > >>> for each input line by 10 times and saw significantly better speedup. > >> But it > >>> is still pretty bad that Hadoop streaming has such big overhead for > >> simple > >>> programs. > >>> > >>> I also tried writing program 1 with Hadoop Java API. I got almost > 1000% > >>> speed up on the cluster. > >>> > >>> Lin > >>> > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy < > [EMAIL PROTECTED]> > >>> wrote: > >>> > >>>> are you running a custom map script or a standard linux command like > >> WC? > >>>> If > >>>> custom, what does your script do? > >>>> > >>>> How much ram do you have? what are you Java memory settings? > >>>> > >>>> I used the following setup > >>>> > >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > >> task > >>>> max. > >>>> > >>>> I got the following results > >>>> > >>>> WC 30-40% speedup > >>>> Sort 40% speedup > >>>> Grep 5X slowdown (turns out this was due to what you described > >> above... > >>>> Grep > >>>> is just very highly optimized for command line) > >>>> Custom perl script which is essentially a For loop which matches each > >> row > >>>> of > >>>> a dataset to a set of 100 categories) 60% speedup. > >>>> > >>>> So I do think that it depends on your script... and some other > >> settings of > >>>> yours. > >>>> > >>>> Theo > >>>> > >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I am looking into using Hadoop streaming to parallelize some simple > >>>>> programs. So far the performance has been pretty disappointing. > >>>>> > >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task > >>>>> capacity > >>>>> of each node is 2. The Hadoop version is 0.15. > >>>>> > >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > >> in > >>>>> standalone (on a single CPU core). Program runs for 5 minutes on the > >>>>> Hadoop > >>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only > >>>> jobs. > >>>>> > >>>>> I understand that there is some overhead in starting up tasks, > >> reading > >>>> to > >>>>> and writing from the distributed file system. But they do not seem > >> to > >>>>> explain all the overhead. Most map tasks are data-local. I modified > >>>>> program > >>>>> 1 to output nothing and saw the same magnitude of overhead. > >>>>> > >>>>> The output of top shows that the majority of the CPU time is > >> consumed by > >>>>> Hadoop java processes (e.g. > >> org.apache.hadoop.mapred.TaskTracker$Child). > >>>>> So > >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to > >>>>> mapred.child.java.opts. > >>>>> > >>>>> The profile results show that most of CPU time is spent in the > >> following > >>>>> methods > >>>>> > >>>>> rank self accum count trace method > >>>>> > >>>>> 1 23.76% 23.76% 1246 300472 > >>>> java.lang.UNIXProcess.waitForProcessExit > >>>>> > >>>>> 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes > >>>>> > >>>>> 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes > >>>>> > >>>>> 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > >>>>> > >>>>> And their stack traces show that these methods are for interacting > >> with > >>>>> the > >>>>> map program. > >>>>> > >>>>> > >>>>> TRACE 300472: > >>>>> > >>>>> > >>>>> java.lang.UNIXProcess.waitForProcessExit( > >> UNIXProcess.java:Unknownline) > >>>>> > >>>>> java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > >>>>> > >>>>> java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > >>>>> > >>>>> TRACE 300474: > >>>>> > >>>>> java.io.FileInputStream.readBytes( > >> FileInputStream.java:Unknown > >>>>> line) > >>>>> > >>>>> java.io.FileInputStream.read(FileInputStream.java:199) > >>>>> > >>>>> java.io.BufferedInputStream.read1(BufferedInputStream.java > >> :256) > >>>>> > >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java > >> :317) > >>>>> > >>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java > >> :218) > >>>>> > >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java > >> :237) > >>>>> > >>>>> java.io.FilterInputStream.read(FilterInputStream.java:66) > >>>>> > >>>>> org.apache.hadoop.mapred.LineRecordReader.readLine( > >>>>> LineRecordReader.java:136) > >>>>> > >>>>> org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > >>>>> UTF8ByteArrayUtils.java:157) > >>>>> > >>>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > >>>>> PipeMapRed.java:348) > >>>>> > >>>>> TRACE 300479: > >>>>> > >>>>> java.io.FileInputStream.readBytes( > >> FileInputStream.java:Unknown > >>>>> line) > >>>>> > >>>>> java.io.FileInputStream.read(FileInputStream.java:199) > >>>>> > >>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java > >> :218) > >>>>> > >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java > >> :237) > >>>>> > >>>>> java.io.FilterInputStream.read(FilterInputStream.java:66) > >>>>> > >>>>> org.apache.hadoop.mapred.LineRecordReader.readLine( > >>>>> LineRecordReader.java:136) > >>>>> > >>>>> org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > >>>>> UTF8ByteArrayUtils.java:157) > >>>>> > >>>>> org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > >>>>> PipeMapRed.java:399) > >>>>> > >>>>> TRACE 300478: > >>>>> > >>>>> > >>>>> java.io.FileOutputStream.writeBytes( > >> FileOutputStream.java:Unknownline) > >>>>> > >>>>> java.io.FileOutputStream.write(FileOutputStream.java:260) > >>>>> > >>>>> java.io.BufferedOutputStream.flushBuffer( > >>>> BufferedOutputStream.java > >>>>> :65) > >>>>> > >>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java > >> :123) > >>>>> > >>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java > >> :124) > >>>>> > >>>>> java.io.DataOutputStream.flush(DataOutputStream.java:106) > >>>>> > >>>>> org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java > >> :96) > >>>>> > >>>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >>>>> > >>>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > >>>>> org.apache.hadoop.mapred.TaskTracker$Child.main( > >> TaskTracker.java > >>>>> :1760) > >>>>> > >>>>> > >>>>> I don't understand why Hadoop streaming needs so much CPU time to > >> read > >>>>> from > >>>>> and write to the map program. Note it takes 23.67% time to read from > >> the > >>>>> standard error of the map program while the program does not output > >> any > >>>>> error at all! > >>>>> > >>>>> Does anyone know any way to get rid of this seemingly unnecessary > >>>> overhead > >>>>> in Hadoop streaming? > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Lin > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Theodore Van Rooy > >>>> http://greentheo.scroggles.com > >>>> > >> > >