This brings up two interesting issues:

1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that should
be made known somewhere on the wiki.  If it's really not supposed to be
used, why is it provided at all?

thanks,
Travis

On Mon, Mar 31, 2008 at 3:23 PM, lin <[EMAIL PROTECTED]> wrote:

> Well, we would like to use hadoop streaming because our current system is
> in
> C++ and it is easier to migrate to hadoop streaming. Also we have very
> strict performance requirements. Java seems to be too slow. I rewrote the
> first program in Java and it runs 4 to 5 times slower than the C++ one.
>
> On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> >
> >
> > Hadoop can't split a gzipped file so you will only get as many maps as
> you
> > have files.
> >
> > Why the obsession with hadoop streaming?  It is at best a jury rigged
> > solution.
> >
> >
> > On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote:
> >
> > > Does Hadoop automatically decompress the gzipped file? I only have a
> > single
> > > input file. Does it have to be splitted and then gzipped?
> > >
> > > I gzipped the input file and Hadoop only created one map task. Still
> > java is
> > > using more than 90% CPU.
> > >
> > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <
> [EMAIL PROTECTED]>
> > > wrote:
> > >
> > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> > >> provide the input files gzipped. Not great difference (e.g. 50%
> slower
> > >> when not gzipped, plus it took more than twice as long to upload the
> > >> data to HDFS-on-S3 in the first place), but still probably relevant.
> > >>
> > >> Andreas
> > >>
> > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > >>> I'm running custom map programs written in C++. What the programs do
> > is
> > >> very
> > >>> simple. For example, in program 2, for each input line        ID
> node1
> > >> node2
> > >>> ... nodeN
> > >>> the program outputs
> > >>>         node1 ID
> > >>>         node2 ID
> > >>>         ...
> > >>>         nodeN ID
> > >>>
> > >>> Each node has 4GB to 8GB of memory. The java memory setting is
> > -Xmx300m.
> > >>>
> > >>> I agree that it depends on the scripts. I tried replicating the
> > >> computation
> > >>> for each input line by 10 times and saw significantly better
> speedup.
> > >> But it
> > >>> is still pretty bad that Hadoop streaming has such big overhead for
> > >> simple
> > >>> programs.
> > >>>
> > >>> I also tried writing program 1 with Hadoop Java API. I got almost
> > 1000%
> > >>> speed up on the cluster.
> > >>>
> > >>> Lin
> > >>>
> > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
> > [EMAIL PROTECTED]>
> > >>> wrote:
> > >>>
> > >>>> are you running a custom map script or a standard linux command
> like
> > >> WC?
> > >>>>  If
> > >>>> custom, what does your script do?
> > >>>>
> > >>>> How much ram do you have?  what are you Java memory settings?
> > >>>>
> > >>>> I used the following setup
> > >>>>
> > >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a
> 4
> > >> task
> > >>>> max.
> > >>>>
> > >>>> I got the following results
> > >>>>
> > >>>> WC 30-40% speedup
> > >>>> Sort 40% speedup
> > >>>> Grep 5X slowdown (turns out this was due to what you described
> > >> above...
> > >>>> Grep
> > >>>> is just very highly optimized for command line)
> > >>>> Custom perl script which is essentially a For loop which matches
> each
> > >> row
> > >>>> of
> > >>>> a dataset to a set of 100 categories) 60% speedup.
> > >>>>
> > >>>> So I do think that it depends on your script... and some other
> > >> settings of
> > >>>> yours.
> > >>>>
> > >>>> Theo
> > >>>>
> > >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am looking into using Hadoop streaming to parallelize some
> simple
> > >>>>> programs. So far the performance has been pretty disappointing.
> > >>>>>
> > >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The
> task
> > >>>>> capacity
> > >>>>> of each node is 2. The Hadoop version is 0.15.
> > >>>>>
> > >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> > >> in
> > >>>>> standalone (on a single CPU core). Program runs for 5 minutes on
> the
> > >>>>> Hadoop
> > >>>>> cluster and 4.5 minutes in standalone. Both programs run as
> map-only
> > >>>> jobs.
> > >>>>>
> > >>>>> I understand that there is some overhead in starting up tasks,
> > >> reading
> > >>>> to
> > >>>>> and writing from the distributed file system. But they do not seem
> > >> to
> > >>>>> explain all the overhead. Most map tasks are data-local. I
> modified
> > >>>>> program
> > >>>>> 1 to output nothing and saw the same magnitude of overhead.
> > >>>>>
> > >>>>> The output of top shows that the majority of the CPU time is
> > >> consumed by
> > >>>>> Hadoop java processes (e.g.
> > >> org.apache.hadoop.mapred.TaskTracker$Child).
> > >>>>> So
> > >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
> > >>>>> mapred.child.java.opts.
> > >>>>>
> > >>>>> The profile results show that most of CPU time is spent in the
> > >> following
> > >>>>> methods
> > >>>>>
> > >>>>>   rank   self  accum   count trace method
> > >>>>>
> > >>>>>   1 23.76% 23.76%    1246 300472
> > >>>> java.lang.UNIXProcess.waitForProcessExit
> > >>>>>
> > >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > >>>>>
> > >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > >>>>>
> > >>>>>   4 16.15% 87.32%     847 300478
> java.io.FileOutputStream.writeBytes
> > >>>>>
> > >>>>> And their stack traces show that these methods are for interacting
> > >> with
> > >>>>> the
> > >>>>> map program.
> > >>>>>
> > >>>>>
> > >>>>> TRACE 300472:
> > >>>>>
> > >>>>>
> > >>>>>  java.lang.UNIXProcess.waitForProcessExit(
> > >> UNIXProcess.java:Unknownline)
> > >>>>>
> > >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >>>>>
> > >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >>>>>
> > >>>>> TRACE 300474:
> > >>>>>
> > >>>>>        java.io.FileInputStream.readBytes(
> > >> FileInputStream.java:Unknown
> > >>>>> line)
> > >>>>>
> > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
> > >> :256)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> > >> :317)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > >> :218)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> > >> :237)
> > >>>>>
> > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > >>>>> LineRecordReader.java:136)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > >>>>> UTF8ByteArrayUtils.java:157)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > >>>>> PipeMapRed.java:348)
> > >>>>>
> > >>>>> TRACE 300479:
> > >>>>>
> > >>>>>        java.io.FileInputStream.readBytes(
> > >> FileInputStream.java:Unknown
> > >>>>> line)
> > >>>>>
> > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > >> :218)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> > >> :237)
> > >>>>>
> > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > >>>>> LineRecordReader.java:136)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > >>>>> UTF8ByteArrayUtils.java:157)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > >>>>> PipeMapRed.java:399)
> > >>>>>
> > >>>>> TRACE 300478:
> > >>>>>
> > >>>>>
> > >>>>>  java.io.FileOutputStream.writeBytes(
> > >> FileOutputStream.java:Unknownline)
> > >>>>>
> > >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >>>>>
> > >>>>>        java.io.BufferedOutputStream.flushBuffer(
> > >>>> BufferedOutputStream.java
> > >>>>> :65)
> > >>>>>
> > >>>>>        java.io.BufferedOutputStream.flush(
> BufferedOutputStream.java
> > >> :123)
> > >>>>>
> > >>>>>        java.io.BufferedOutputStream.flush(
> BufferedOutputStream.java
> > >> :124)
> > >>>>>
> > >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> > >> :96)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
> > >> TaskTracker.java
> > >>>>> :1760)
> > >>>>>
> > >>>>>
> > >>>>> I don't understand why Hadoop streaming needs so much CPU time to
> > >> read
> > >>>>> from
> > >>>>> and write to the map program. Note it takes 23.67% time to read
> from
> > >> the
> > >>>>> standard error of the map program while the program does not
> output
> > >> any
> > >>>>> error at all!
> > >>>>>
> > >>>>> Does anyone know any way to get rid of this seemingly unnecessary
> > >>>> overhead
> > >>>>> in Hadoop streaming?
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Lin
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Theodore Van Rooy
> > >>>> http://greentheo.scroggles.com
> > >>>>
> > >>
> >
> >
>



-- 
Travis Brady
www.mochiads.com

Reply via email to