LineRecordReader.readLine() is deprecated by 
HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was 
slow.
But streaming still uses the method. HADOOP-2826 
(http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in 
streaming.
This change should improve streaming performance. When I ran simple cat from streaming, with HADOOP-2826 it ran in 33 seconds whereas with trunk it took 52 seconds.

Thanks
Amareshwari.

lin wrote:
Hi,

I am looking into using Hadoop streaming to parallelize some simple
programs. So far the performance has been pretty disappointing.

The cluster contains 5 nodes. Each node has two CPU cores. The task capacity
of each node is 2. The Hadoop version is 0.15.

Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
standalone (on a single CPU core). Program runs for 5 minutes on the Hadoop
cluster and 4.5 minutes in standalone. Both programs run as map-only jobs.

I understand that there is some overhead in starting up tasks, reading to
and writing from the distributed file system. But they do not seem to
explain all the overhead. Most map tasks are data-local. I modified program
1 to output nothing and saw the same magnitude of overhead.

The output of top shows that the majority of the CPU time is consumed by
Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). So
I added a profile option (-agentlib:hprof=cpu=samples) to
mapred.child.java.opts.

The profile results show that most of CPU time is spent in the following
methods

   rank   self  accum   count trace method

   1 23.76% 23.76%    1246 300472 java.lang.UNIXProcess.waitForProcessExit

   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes

   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes

   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes

And their stack traces show that these methods are for interacting with the
map program.


TRACE 300472:

        java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)

        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)

        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

TRACE 300474:

        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)

        java.io.FileInputStream.read(FileInputStream.java:199)

        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)

        java.io.BufferedInputStream.read(BufferedInputStream.java:317)

        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        java.io.FilterInputStream.read(FilterInputStream.java:66)

        org.apache.hadoop.mapred.LineRecordReader.readLine(
LineRecordReader.java:136)

        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
UTF8ByteArrayUtils.java:157)

        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
PipeMapRed.java:348)

TRACE 300479:

        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)

        java.io.FileInputStream.read(FileInputStream.java:199)

        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        java.io.FilterInputStream.read(FilterInputStream.java:66)

        org.apache.hadoop.mapred.LineRecordReader.readLine(
LineRecordReader.java:136)

        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
UTF8ByteArrayUtils.java:157)

        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
PipeMapRed.java:399)

TRACE 300478:

        java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)

        java.io.FileOutputStream.write(FileOutputStream.java:260)

        java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
:65)

        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)

        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)

        java.io.DataOutputStream.flush(DataOutputStream.java:106)

        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)

        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
:1760)


I don't understand why Hadoop streaming needs so much CPU time to read from
and write to the map program. Note it takes 23.67% time to read from the
standard error of the map program while the program does not output any
error at all!

Does anyone know any way to get rid of this seemingly unnecessary overhead
in Hadoop streaming?

Thanks,

Lin


Reply via email to