Hadoop can't split a gzipped file so you will only get as many maps as you
have files.

Why the obsession with hadoop streaming?  It is at best a jury rigged
solution.


On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote:

> Does Hadoop automatically decompress the gzipped file? I only have a single
> input file. Does it have to be splitted and then gzipped?
> 
> I gzipped the input file and Hadoop only created one map task. Still java is
> using more than 90% CPU.
> 
> On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]>
> wrote:
> 
>> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
>> provide the input files gzipped. Not great difference (e.g. 50% slower
>> when not gzipped, plus it took more than twice as long to upload the
>> data to HDFS-on-S3 in the first place), but still probably relevant.
>> 
>> Andreas
>> 
>> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
>>> I'm running custom map programs written in C++. What the programs do is
>> very
>>> simple. For example, in program 2, for each input line        ID node1
>> node2
>>> ... nodeN
>>> the program outputs
>>>         node1 ID
>>>         node2 ID
>>>         ...
>>>         nodeN ID
>>> 
>>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
>>> 
>>> I agree that it depends on the scripts. I tried replicating the
>> computation
>>> for each input line by 10 times and saw significantly better speedup.
>> But it
>>> is still pretty bad that Hadoop streaming has such big overhead for
>> simple
>>> programs.
>>> 
>>> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
>>> speed up on the cluster.
>>> 
>>> Lin
>>> 
>>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
>>> wrote:
>>> 
>>>> are you running a custom map script or a standard linux command like
>> WC?
>>>>  If
>>>> custom, what does your script do?
>>>> 
>>>> How much ram do you have?  what are you Java memory settings?
>>>> 
>>>> I used the following setup
>>>> 
>>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
>> task
>>>> max.
>>>> 
>>>> I got the following results
>>>> 
>>>> WC 30-40% speedup
>>>> Sort 40% speedup
>>>> Grep 5X slowdown (turns out this was due to what you described
>> above...
>>>> Grep
>>>> is just very highly optimized for command line)
>>>> Custom perl script which is essentially a For loop which matches each
>> row
>>>> of
>>>> a dataset to a set of 100 categories) 60% speedup.
>>>> 
>>>> So I do think that it depends on your script... and some other
>> settings of
>>>> yours.
>>>> 
>>>> Theo
>>>> 
>>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am looking into using Hadoop streaming to parallelize some simple
>>>>> programs. So far the performance has been pretty disappointing.
>>>>> 
>>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task
>>>>> capacity
>>>>> of each node is 2. The Hadoop version is 0.15.
>>>>> 
>>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
>> in
>>>>> standalone (on a single CPU core). Program runs for 5 minutes on the
>>>>> Hadoop
>>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only
>>>> jobs.
>>>>> 
>>>>> I understand that there is some overhead in starting up tasks,
>> reading
>>>> to
>>>>> and writing from the distributed file system. But they do not seem
>> to
>>>>> explain all the overhead. Most map tasks are data-local. I modified
>>>>> program
>>>>> 1 to output nothing and saw the same magnitude of overhead.
>>>>> 
>>>>> The output of top shows that the majority of the CPU time is
>> consumed by
>>>>> Hadoop java processes (e.g.
>> org.apache.hadoop.mapred.TaskTracker$Child).
>>>>> So
>>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
>>>>> mapred.child.java.opts.
>>>>> 
>>>>> The profile results show that most of CPU time is spent in the
>> following
>>>>> methods
>>>>> 
>>>>>   rank   self  accum   count trace method
>>>>> 
>>>>>   1 23.76% 23.76%    1246 300472
>>>> java.lang.UNIXProcess.waitForProcessExit
>>>>> 
>>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>>>>> 
>>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>>>>> 
>>>>>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
>>>>> 
>>>>> And their stack traces show that these methods are for interacting
>> with
>>>>> the
>>>>> map program.
>>>>> 
>>>>> 
>>>>> TRACE 300472:
>>>>> 
>>>>> 
>>>>>  java.lang.UNIXProcess.waitForProcessExit(
>> UNIXProcess.java:Unknownline)
>>>>> 
>>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>> 
>>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>> 
>>>>> TRACE 300474:
>>>>> 
>>>>>        java.io.FileInputStream.readBytes(
>> FileInputStream.java:Unknown
>>>>> line)
>>>>> 
>>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>>>>> 
>>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
>> :256)
>>>>> 
>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>> :317)
>>>>> 
>>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>> :218)
>>>>> 
>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>> :237)
>>>>> 
>>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>>>>> 
>>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>>>>> LineRecordReader.java:136)
>>>>> 
>>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>>>>> UTF8ByteArrayUtils.java:157)
>>>>> 
>>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>>>> PipeMapRed.java:348)
>>>>> 
>>>>> TRACE 300479:
>>>>> 
>>>>>        java.io.FileInputStream.readBytes(
>> FileInputStream.java:Unknown
>>>>> line)
>>>>> 
>>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>>>>> 
>>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>> :218)
>>>>> 
>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>> :237)
>>>>> 
>>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>>>>> 
>>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>>>>> LineRecordReader.java:136)
>>>>> 
>>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>>>>> UTF8ByteArrayUtils.java:157)
>>>>> 
>>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
>>>>> PipeMapRed.java:399)
>>>>> 
>>>>> TRACE 300478:
>>>>> 
>>>>> 
>>>>>  java.io.FileOutputStream.writeBytes(
>> FileOutputStream.java:Unknownline)
>>>>> 
>>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>>>>> 
>>>>>        java.io.BufferedOutputStream.flushBuffer(
>>>> BufferedOutputStream.java
>>>>> :65)
>>>>> 
>>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
>> :123)
>>>>> 
>>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
>> :124)
>>>>> 
>>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>>>> 
>>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
>> :96)
>>>>> 
>>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>> 
>>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
>> TaskTracker.java
>>>>> :1760)
>>>>> 
>>>>> 
>>>>> I don't understand why Hadoop streaming needs so much CPU time to
>> read
>>>>> from
>>>>> and write to the map program. Note it takes 23.67% time to read from
>> the
>>>>> standard error of the map program while the program does not output
>> any
>>>>> error at all!
>>>>> 
>>>>> Does anyone know any way to get rid of this seemingly unnecessary
>>>> overhead
>>>>> in Hadoop streaming?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Lin
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Theodore Van Rooy
>>>> http://greentheo.scroggles.com
>>>> 
>> 

Reply via email to