Hadoop can't split a gzipped file so you will only get as many maps as you have files.
Why the obsession with hadoop streaming? It is at best a jury rigged solution. On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > Does Hadoop automatically decompress the gzipped file? I only have a single > input file. Does it have to be splitted and then gzipped? > > I gzipped the input file and Hadoop only created one map task. Still java is > using more than 90% CPU. > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > wrote: > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to >> provide the input files gzipped. Not great difference (e.g. 50% slower >> when not gzipped, plus it took more than twice as long to upload the >> data to HDFS-on-S3 in the first place), but still probably relevant. >> >> Andreas >> >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: >>> I'm running custom map programs written in C++. What the programs do is >> very >>> simple. For example, in program 2, for each input line ID node1 >> node2 >>> ... nodeN >>> the program outputs >>> node1 ID >>> node2 ID >>> ... >>> nodeN ID >>> >>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. >>> >>> I agree that it depends on the scripts. I tried replicating the >> computation >>> for each input line by 10 times and saw significantly better speedup. >> But it >>> is still pretty bad that Hadoop streaming has such big overhead for >> simple >>> programs. >>> >>> I also tried writing program 1 with Hadoop Java API. I got almost 1000% >>> speed up on the cluster. >>> >>> Lin >>> >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> >>> wrote: >>> >>>> are you running a custom map script or a standard linux command like >> WC? >>>> If >>>> custom, what does your script do? >>>> >>>> How much ram do you have? what are you Java memory settings? >>>> >>>> I used the following setup >>>> >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 >> task >>>> max. >>>> >>>> I got the following results >>>> >>>> WC 30-40% speedup >>>> Sort 40% speedup >>>> Grep 5X slowdown (turns out this was due to what you described >> above... >>>> Grep >>>> is just very highly optimized for command line) >>>> Custom perl script which is essentially a For loop which matches each >> row >>>> of >>>> a dataset to a set of 100 categories) 60% speedup. >>>> >>>> So I do think that it depends on your script... and some other >> settings of >>>> yours. >>>> >>>> Theo >>>> >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am looking into using Hadoop streaming to parallelize some simple >>>>> programs. So far the performance has been pretty disappointing. >>>>> >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task >>>>> capacity >>>>> of each node is 2. The Hadoop version is 0.15. >>>>> >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes >> in >>>>> standalone (on a single CPU core). Program runs for 5 minutes on the >>>>> Hadoop >>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only >>>> jobs. >>>>> >>>>> I understand that there is some overhead in starting up tasks, >> reading >>>> to >>>>> and writing from the distributed file system. But they do not seem >> to >>>>> explain all the overhead. Most map tasks are data-local. I modified >>>>> program >>>>> 1 to output nothing and saw the same magnitude of overhead. >>>>> >>>>> The output of top shows that the majority of the CPU time is >> consumed by >>>>> Hadoop java processes (e.g. >> org.apache.hadoop.mapred.TaskTracker$Child). >>>>> So >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to >>>>> mapred.child.java.opts. >>>>> >>>>> The profile results show that most of CPU time is spent in the >> following >>>>> methods >>>>> >>>>> rank self accum count trace method >>>>> >>>>> 1 23.76% 23.76% 1246 300472 >>>> java.lang.UNIXProcess.waitForProcessExit >>>>> >>>>> 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes >>>>> >>>>> 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes >>>>> >>>>> 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes >>>>> >>>>> And their stack traces show that these methods are for interacting >> with >>>>> the >>>>> map program. >>>>> >>>>> >>>>> TRACE 300472: >>>>> >>>>> >>>>> java.lang.UNIXProcess.waitForProcessExit( >> UNIXProcess.java:Unknownline) >>>>> >>>>> java.lang.UNIXProcess.access$900(UNIXProcess.java:20) >>>>> >>>>> java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) >>>>> >>>>> TRACE 300474: >>>>> >>>>> java.io.FileInputStream.readBytes( >> FileInputStream.java:Unknown >>>>> line) >>>>> >>>>> java.io.FileInputStream.read(FileInputStream.java:199) >>>>> >>>>> java.io.BufferedInputStream.read1(BufferedInputStream.java >> :256) >>>>> >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java >> :317) >>>>> >>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java >> :218) >>>>> >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java >> :237) >>>>> >>>>> java.io.FilterInputStream.read(FilterInputStream.java:66) >>>>> >>>>> org.apache.hadoop.mapred.LineRecordReader.readLine( >>>>> LineRecordReader.java:136) >>>>> >>>>> org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( >>>>> UTF8ByteArrayUtils.java:157) >>>>> >>>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( >>>>> PipeMapRed.java:348) >>>>> >>>>> TRACE 300479: >>>>> >>>>> java.io.FileInputStream.readBytes( >> FileInputStream.java:Unknown >>>>> line) >>>>> >>>>> java.io.FileInputStream.read(FileInputStream.java:199) >>>>> >>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java >> :218) >>>>> >>>>> java.io.BufferedInputStream.read(BufferedInputStream.java >> :237) >>>>> >>>>> java.io.FilterInputStream.read(FilterInputStream.java:66) >>>>> >>>>> org.apache.hadoop.mapred.LineRecordReader.readLine( >>>>> LineRecordReader.java:136) >>>>> >>>>> org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( >>>>> UTF8ByteArrayUtils.java:157) >>>>> >>>>> org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( >>>>> PipeMapRed.java:399) >>>>> >>>>> TRACE 300478: >>>>> >>>>> >>>>> java.io.FileOutputStream.writeBytes( >> FileOutputStream.java:Unknownline) >>>>> >>>>> java.io.FileOutputStream.write(FileOutputStream.java:260) >>>>> >>>>> java.io.BufferedOutputStream.flushBuffer( >>>> BufferedOutputStream.java >>>>> :65) >>>>> >>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java >> :123) >>>>> >>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java >> :124) >>>>> >>>>> java.io.DataOutputStream.flush(DataOutputStream.java:106) >>>>> >>>>> org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java >> :96) >>>>> >>>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >>>>> >>>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) >>>>> org.apache.hadoop.mapred.TaskTracker$Child.main( >> TaskTracker.java >>>>> :1760) >>>>> >>>>> >>>>> I don't understand why Hadoop streaming needs so much CPU time to >> read >>>>> from >>>>> and write to the map program. Note it takes 23.67% time to read from >> the >>>>> standard error of the map program while the program does not output >> any >>>>> error at all! >>>>> >>>>> Does anyone know any way to get rid of this seemingly unnecessary >>>> overhead >>>>> in Hadoop streaming? >>>>> >>>>> Thanks, >>>>> >>>>> Lin >>>>> >>>> >>>> >>>> >>>> -- >>>> Theodore Van Rooy >>>> http://greentheo.scroggles.com >>>> >>