Re: Hadoop streaming performance problem

Andreas Kostyrka Mon, 31 Mar 2008 17:20:40 -0700

Because many many people do one enjoy that verbose language you know.
(Just replaced an old 754 lines long task with a ported one that take 89
lines.)


So as crazy it might sound to some here, hadoop streaming is the primary
interface for probably a sizeable part of the "user population". (Users
being developers writing workloads for Hadoop.)

Andreas

Am Montag, den 31.03.2008, 15:15 -0700 schrieb Ted Dunning:
> 
> Hadoop can't split a gzipped file so you will only get as many maps as you
> have files.
> 
> Why the obsession with hadoop streaming?  It is at best a jury rigged
> solution.
> 
> 
> On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote:
> 
> > Does Hadoop automatically decompress the gzipped file? I only have a single
> > input file. Does it have to be splitted and then gzipped?
> > 
> > I gzipped the input file and Hadoop only created one map task. Still java is
> > using more than 90% CPU.
> > 
> > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]>
> > wrote:
> > 
> >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> >> provide the input files gzipped. Not great difference (e.g. 50% slower
> >> when not gzipped, plus it took more than twice as long to upload the
> >> data to HDFS-on-S3 in the first place), but still probably relevant.
> >> 
> >> Andreas
> >> 
> >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> >>> I'm running custom map programs written in C++. What the programs do is
> >> very
> >>> simple. For example, in program 2, for each input line        ID node1
> >> node2
> >>> ... nodeN
> >>> the program outputs
> >>>         node1 ID
> >>>         node2 ID
> >>>         ...
> >>>         nodeN ID
> >>> 
> >>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> >>> 
> >>> I agree that it depends on the scripts. I tried replicating the
> >> computation
> >>> for each input line by 10 times and saw significantly better speedup.
> >> But it
> >>> is still pretty bad that Hadoop streaming has such big overhead for
> >> simple
> >>> programs.
> >>> 
> >>> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> >>> speed up on the cluster.
> >>> 
> >>> Lin
> >>> 
> >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
> >>> wrote:
> >>> 
> >>>> are you running a custom map script or a standard linux command like
> >> WC?
> >>>>  If
> >>>> custom, what does your script do?
> >>>> 
> >>>> How much ram do you have?  what are you Java memory settings?
> >>>> 
> >>>> I used the following setup
> >>>> 
> >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> >> task
> >>>> max.
> >>>> 
> >>>> I got the following results
> >>>> 
> >>>> WC 30-40% speedup
> >>>> Sort 40% speedup
> >>>> Grep 5X slowdown (turns out this was due to what you described
> >> above...
> >>>> Grep
> >>>> is just very highly optimized for command line)
> >>>> Custom perl script which is essentially a For loop which matches each
> >> row
> >>>> of
> >>>> a dataset to a set of 100 categories) 60% speedup.
> >>>> 
> >>>> So I do think that it depends on your script... and some other
> >> settings of
> >>>> yours.
> >>>> 
> >>>> Theo
> >>>> 
> >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:
> >>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> I am looking into using Hadoop streaming to parallelize some simple
> >>>>> programs. So far the performance has been pretty disappointing.
> >>>>> 
> >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task
> >>>>> capacity
> >>>>> of each node is 2. The Hadoop version is 0.15.
> >>>>> 
> >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> >> in
> >>>>> standalone (on a single CPU core). Program runs for 5 minutes on the
> >>>>> Hadoop
> >>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only
> >>>> jobs.
> >>>>> 
> >>>>> I understand that there is some overhead in starting up tasks,
> >> reading
> >>>> to
> >>>>> and writing from the distributed file system. But they do not seem
> >> to
> >>>>> explain all the overhead. Most map tasks are data-local. I modified
> >>>>> program
> >>>>> 1 to output nothing and saw the same magnitude of overhead.
> >>>>> 
> >>>>> The output of top shows that the majority of the CPU time is
> >> consumed by
> >>>>> Hadoop java processes (e.g.
> >> org.apache.hadoop.mapred.TaskTracker$Child).
> >>>>> So
> >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
> >>>>> mapred.child.java.opts.
> >>>>> 
> >>>>> The profile results show that most of CPU time is spent in the
> >> following
> >>>>> methods
> >>>>> 
> >>>>>   rank   self  accum   count trace method
> >>>>> 
> >>>>>   1 23.76% 23.76%    1246 300472
> >>>> java.lang.UNIXProcess.waitForProcessExit
> >>>>> 
> >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> >>>>> 
> >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> >>>>> 
> >>>>>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> >>>>> 
> >>>>> And their stack traces show that these methods are for interacting
> >> with
> >>>>> the
> >>>>> map program.
> >>>>> 
> >>>>> 
> >>>>> TRACE 300472:
> >>>>> 
> >>>>> 
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(
> >> UNIXProcess.java:Unknownline)
> >>>>> 
> >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>> 
> >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>> 
> >>>>> TRACE 300474:
> >>>>> 
> >>>>>        java.io.FileInputStream.readBytes(
> >> FileInputStream.java:Unknown
> >>>>> line)
> >>>>> 
> >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
> >> :256)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :317)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> >> :218)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :237)
> >>>>> 
> >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> >>>>> LineRecordReader.java:136)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> >>>>> UTF8ByteArrayUtils.java:157)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> >>>>> PipeMapRed.java:348)
> >>>>> 
> >>>>> TRACE 300479:
> >>>>> 
> >>>>>        java.io.FileInputStream.readBytes(
> >> FileInputStream.java:Unknown
> >>>>> line)
> >>>>> 
> >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> >> :218)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :237)
> >>>>> 
> >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> >>>>> LineRecordReader.java:136)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> >>>>> UTF8ByteArrayUtils.java:157)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> >>>>> PipeMapRed.java:399)
> >>>>> 
> >>>>> TRACE 300478:
> >>>>> 
> >>>>> 
> >>>>>  java.io.FileOutputStream.writeBytes(
> >> FileOutputStream.java:Unknownline)
> >>>>> 
> >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
> >>>>> 
> >>>>>        java.io.BufferedOutputStream.flushBuffer(
> >>>> BufferedOutputStream.java
> >>>>> :65)
> >>>>> 
> >>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> >> :123)
> >>>>> 
> >>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> >> :124)
> >>>>> 
> >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> >> :96)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
> >> TaskTracker.java
> >>>>> :1760)
> >>>>> 
> >>>>> 
> >>>>> I don't understand why Hadoop streaming needs so much CPU time to
> >> read
> >>>>> from
> >>>>> and write to the map program. Note it takes 23.67% time to read from
> >> the
> >>>>> standard error of the map program while the program does not output
> >> any
> >>>>> error at all!
> >>>>> 
> >>>>> Does anyone know any way to get rid of this seemingly unnecessary
> >>>> overhead
> >>>>> in Hadoop streaming?
> >>>>> 
> >>>>> Thanks,
> >>>>> 
> >>>>> Lin
> >>>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Theodore Van Rooy
> >>>> http://greentheo.scroggles.com
> >>>> 
> >>

signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

Re: Hadoop streaming performance problem

Reply via email to