Yes. I would have to look up the exact numbers in my Sent folder (I've
been reporting all the time to my boss), but it has been a sizeable
portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when
one included the time to load the data into HDFS.

Ah gone ahead and looked it up:

        93minutes for memberfind on the unzipped input files (+ 45
        minutes to
        upload the 15GB of unzipped data).
        
        70minutes for memberfind on the gzipped files (Critical
        important: you
        need to name the files with a .gz suffix or Hadoop won't
        recognize and
        unzip it, which trashes the runs :( )
        
So if I add up the difference it's almost 100% of the runtime of the
gzipped run. (input set is 15GB unzipped, 24.8M lines, 1.6GB gzipped).

One interesting detail is that I have taken also the time to measure
runtime without Hadoop streaming for my input set, and Hadoop took less
than 5% time compared to running it directly. (Guess if had run two
processes to amortize the time where the VM's "disc" was loading time, I
might had managed to speed up the throughput of the "native" run
slightly.) The relevant thing here is that the performance hit my Hadoop
seemed low enough that I had no case for dropping Hadoop completely. ;)

Andreas


Am Montag, den 31.03.2008, 18:42 -0400 schrieb Colin Freas:
> Really?
> 
> I would expect the opposite: for compressed files to process slower.
> 
> You're saying that is not the case, and that compressed files actually
> increase the speed of jobs?
> 
> -Colin
> 
> On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]>
> wrote:
> 
> > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> > provide the input files gzipped. Not great difference (e.g. 50% slower
> > when not gzipped, plus it took more than twice as long to upload the
> > data to HDFS-on-S3 in the first place), but still probably relevant.
> >
> > Andreas
> >
> > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > > I'm running custom map programs written in C++. What the programs do is
> > very
> > > simple. For example, in program 2, for each input line        ID node1
> > node2
> > > ... nodeN
> > > the program outputs
> > >         node1 ID
> > >         node2 ID
> > >         ...
> > >         nodeN ID
> > >
> > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> > >
> > > I agree that it depends on the scripts. I tried replicating the
> > computation
> > > for each input line by 10 times and saw significantly better speedup.
> > But it
> > > is still pretty bad that Hadoop streaming has such big overhead for
> > simple
> > > programs.
> > >
> > > I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> > > speed up on the cluster.
> > >
> > > Lin
> > >
> > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > are you running a custom map script or a standard linux command like
> > WC?
> > > >  If
> > > > custom, what does your script do?
> > > >
> > > > How much ram do you have?  what are you Java memory settings?
> > > >
> > > > I used the following setup
> > > >
> > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> > task
> > > > max.
> > > >
> > > > I got the following results
> > > >
> > > > WC 30-40% speedup
> > > > Sort 40% speedup
> > > > Grep 5X slowdown (turns out this was due to what you described
> > above...
> > > > Grep
> > > > is just very highly optimized for command line)
> > > > Custom perl script which is essentially a For loop which matches each
> > row
> > > > of
> > > > a dataset to a set of 100 categories) 60% speedup.
> > > >
> > > > So I do think that it depends on your script... and some other
> > settings of
> > > > yours.
> > > >
> > > > Theo
> > > >
> > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am looking into using Hadoop streaming to parallelize some simple
> > > > > programs. So far the performance has been pretty disappointing.
> > > > >
> > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > > > capacity
> > > > > of each node is 2. The Hadoop version is 0.15.
> > > > >
> > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> > in
> > > > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > > > Hadoop
> > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > > > jobs.
> > > > >
> > > > > I understand that there is some overhead in starting up tasks,
> > reading
> > > > to
> > > > > and writing from the distributed file system. But they do not seem
> > to
> > > > > explain all the overhead. Most map tasks are data-local. I modified
> > > > > program
> > > > > 1 to output nothing and saw the same magnitude of overhead.
> > > > >
> > > > > The output of top shows that the majority of the CPU time is
> > consumed by
> > > > > Hadoop java processes (e.g.
> > org.apache.hadoop.mapred.TaskTracker$Child).
> > > > > So
> > > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > > mapred.child.java.opts.
> > > > >
> > > > > The profile results show that most of CPU time is spent in the
> > following
> > > > > methods
> > > > >
> > > > >   rank   self  accum   count trace method
> > > > >
> > > > >   1 23.76% 23.76%    1246 300472
> > > > java.lang.UNIXProcess.waitForProcessExit
> > > > >
> > > > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > > > >
> > > > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > > > >
> > > > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > > > >
> > > > > And their stack traces show that these methods are for interacting
> > with
> > > > > the
> > > > > map program.
> > > > >
> > > > >
> > > > > TRACE 300472:
> > > > >
> > > > >
> > > > >  java.lang.UNIXProcess.waitForProcessExit(
> > UNIXProcess.java:Unknownline)
> > > > >
> > > > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > > >
> > > > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > > >
> > > > > TRACE 300474:
> > > > >
> > > > >        java.io.FileInputStream.readBytes(
> > FileInputStream.java:Unknown
> > > > > line)
> > > > >
> > > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > > >
> > > > >        java.io.BufferedInputStream.read1(BufferedInputStream.java
> > :256)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :317)
> > > > >
> > > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > :218)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :237)
> > > > >
> > > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > > >
> > > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > > LineRecordReader.java:136)
> > > > >
> > > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > > UTF8ByteArrayUtils.java:157)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > > > PipeMapRed.java:348)
> > > > >
> > > > > TRACE 300479:
> > > > >
> > > > >        java.io.FileInputStream.readBytes(
> > FileInputStream.java:Unknown
> > > > > line)
> > > > >
> > > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > > >
> > > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > :218)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :237)
> > > > >
> > > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > > >
> > > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > > LineRecordReader.java:136)
> > > > >
> > > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > > UTF8ByteArrayUtils.java:157)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > > > PipeMapRed.java:399)
> > > > >
> > > > > TRACE 300478:
> > > > >
> > > > >
> > > > >  java.io.FileOutputStream.writeBytes(
> > FileOutputStream.java:Unknownline)
> > > > >
> > > > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > > > >
> > > > >        java.io.BufferedOutputStream.flushBuffer(
> > > > BufferedOutputStream.java
> > > > > :65)
> > > > >
> > > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> > :123)
> > > > >
> > > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> > :124)
> > > > >
> > > > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> > :96)
> > > > >
> > > > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > > >
> > > > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > > > >        org.apache.hadoop.mapred.TaskTracker$Child.main(
> > TaskTracker.java
> > > > > :1760)
> > > > >
> > > > >
> > > > > I don't understand why Hadoop streaming needs so much CPU time to
> > read
> > > > > from
> > > > > and write to the map program. Note it takes 23.67% time to read from
> > the
> > > > > standard error of the map program while the program does not output
> > any
> > > > > error at all!
> > > > >
> > > > > Does anyone know any way to get rid of this seemingly unnecessary
> > > > overhead
> > > > > in Hadoop streaming?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Lin
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Theodore Van Rooy
> > > > http://greentheo.scroggles.com
> > > >
> >

Attachment: signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

Reply via email to