Re: Hadoop streaming performance problem

Theodore Van Rooy Mon, 31 Mar 2008 13:11:03 -0700

are you running a custom map script or a standard linux command like WC?  If
custom, what does your script do?


How much ram do you have?  what are you Java memory settings?

I used the following setup

2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task
max.

I got the following results

WC 30-40% speedup
Sort 40% speedup
Grep 5X slowdown (turns out this was due to what you described above... Grep
is just very highly optimized for command line)
Custom perl script which is essentially a For loop which matches each row of
a dataset to a set of 100 categories) 60% speedup.

So I do think that it depends on your script... and some other settings of
yours.

Theo

On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am looking into using Hadoop streaming to parallelize some simple
> programs. So far the performance has been pretty disappointing.
>
> The cluster contains 5 nodes. Each node has two CPU cores. The task
> capacity
> of each node is 2. The Hadoop version is 0.15.
>
> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> standalone (on a single CPU core). Program runs for 5 minutes on the
> Hadoop
> cluster and 4.5 minutes in standalone. Both programs run as map-only jobs.
>
> I understand that there is some overhead in starting up tasks, reading to
> and writing from the distributed file system. But they do not seem to
> explain all the overhead. Most map tasks are data-local. I modified
> program
> 1 to output nothing and saw the same magnitude of overhead.
>
> The output of top shows that the majority of the CPU time is consumed by
> Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child).
> So
> I added a profile option (-agentlib:hprof=cpu=samples) to
> mapred.child.java.opts.
>
> The profile results show that most of CPU time is spent in the following
> methods
>
>   rank   self  accum   count trace method
>
>   1 23.76% 23.76%    1246 300472 java.lang.UNIXProcess.waitForProcessExit
>
>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>
>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>
>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
>
> And their stack traces show that these methods are for interacting with
> the
> map program.
>
>
> TRACE 300472:
>
>
>  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
>
>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>
>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>
> TRACE 300474:
>
>        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> line)
>
>        java.io.FileInputStream.read(FileInputStream.java:199)
>
>        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>
>        java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>
>        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>
>        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>
>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>
>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> LineRecordReader.java:136)
>
>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> UTF8ByteArrayUtils.java:157)
>
>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> PipeMapRed.java:348)
>
> TRACE 300479:
>
>        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> line)
>
>        java.io.FileInputStream.read(FileInputStream.java:199)
>
>        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>
>        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>
>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>
>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> LineRecordReader.java:136)
>
>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> UTF8ByteArrayUtils.java:157)
>
>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> PipeMapRed.java:399)
>
> TRACE 300478:
>
>
>  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
>
>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>
>        java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
> :65)
>
>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>
>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>
>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>
>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>
>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>
>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>        org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
>
>
> I don't understand why Hadoop streaming needs so much CPU time to read
> from
> and write to the map program. Note it takes 23.67% time to read from the
> standard error of the map program while the program does not output any
> error at all!
>
> Does anyone know any way to get rid of this seemingly unnecessary overhead
> in Hadoop streaming?
>
> Thanks,
>
> Lin
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Hadoop streaming performance problem

Reply via email to