Yes. I would have to look up the exact numbers in my Sent folder (I've been reporting all the time to my boss), but it has been a sizeable portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when one included the time to load the data into HDFS.
Ah gone ahead and looked it up: 93minutes for memberfind on the unzipped input files (+ 45 minutes to upload the 15GB of unzipped data). 70minutes for memberfind on the gzipped files (Critical important: you need to name the files with a .gz suffix or Hadoop won't recognize and unzip it, which trashes the runs :( ) So if I add up the difference it's almost 100% of the runtime of the gzipped run. (input set is 15GB unzipped, 24.8M lines, 1.6GB gzipped). One interesting detail is that I have taken also the time to measure runtime without Hadoop streaming for my input set, and Hadoop took less than 5% time compared to running it directly. (Guess if had run two processes to amortize the time where the VM's "disc" was loading time, I might had managed to speed up the throughput of the "native" run slightly.) The relevant thing here is that the performance hit my Hadoop seemed low enough that I had no case for dropping Hadoop completely. ;) Andreas Am Montag, den 31.03.2008, 18:42 -0400 schrieb Colin Freas: > Really? > > I would expect the opposite: for compressed files to process slower. > > You're saying that is not the case, and that compressed files actually > increase the speed of jobs? > > -Colin > > On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > wrote: > > > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > > provide the input files gzipped. Not great difference (e.g. 50% slower > > when not gzipped, plus it took more than twice as long to upload the > > data to HDFS-on-S3 in the first place), but still probably relevant. > > > > Andreas > > > > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > > I'm running custom map programs written in C++. What the programs do is > > very > > > simple. For example, in program 2, for each input line ID node1 > > node2 > > > ... nodeN > > > the program outputs > > > node1 ID > > > node2 ID > > > ... > > > nodeN ID > > > > > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > > > > > I agree that it depends on the scripts. I tried replicating the > > computation > > > for each input line by 10 times and saw significantly better speedup. > > But it > > > is still pretty bad that Hadoop streaming has such big overhead for > > simple > > > programs. > > > > > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > > > speed up on the cluster. > > > > > > Lin > > > > > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > are you running a custom map script or a standard linux command like > > WC? > > > > If > > > > custom, what does your script do? > > > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > > > I used the following setup > > > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > > task > > > > max. > > > > > > > > I got the following results > > > > > > > > WC 30-40% speedup > > > > Sort 40% speedup > > > > Grep 5X slowdown (turns out this was due to what you described > > above... > > > > Grep > > > > is just very highly optimized for command line) > > > > Custom perl script which is essentially a For loop which matches each > > row > > > > of > > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > > > So I do think that it depends on your script... and some other > > settings of > > > > yours. > > > > > > > > Theo > > > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > > > Hi, > > > > > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > > > programs. So far the performance has been pretty disappointing. > > > > > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > > > capacity > > > > > of each node is 2. The Hadoop version is 0.15. > > > > > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > > in > > > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > > > Hadoop > > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > > > jobs. > > > > > > > > > > I understand that there is some overhead in starting up tasks, > > reading > > > > to > > > > > and writing from the distributed file system. But they do not seem > > to > > > > > explain all the overhead. Most map tasks are data-local. I modified > > > > > program > > > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > > > > > The output of top shows that the majority of the CPU time is > > consumed by > > > > > Hadoop java processes (e.g. > > org.apache.hadoop.mapred.TaskTracker$Child). > > > > > So > > > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > > > mapred.child.java.opts. > > > > > > > > > > The profile results show that most of CPU time is spent in the > > following > > > > > methods > > > > > > > > > > rank self accum count trace method > > > > > > > > > > 1 23.76% 23.76% 1246 300472 > > > > java.lang.UNIXProcess.waitForProcessExit > > > > > > > > > > 2 23.74% 47.50% 1245 300474 java.io.FileInputStream.readBytes > > > > > > > > > > 3 23.67% 71.17% 1241 300479 java.io.FileInputStream.readBytes > > > > > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > > > > > And their stack traces show that these methods are for interacting > > with > > > > > the > > > > > map program. > > > > > > > > > > > > > > > TRACE 300472: > > > > > > > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit( > > UNIXProcess.java:Unknownline) > > > > > > > > > > java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > > > > > > java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > > > > > TRACE 300474: > > > > > > > > > > java.io.FileInputStream.readBytes( > > FileInputStream.java:Unknown > > > > > line) > > > > > > > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > > > > > java.io.BufferedInputStream.read1(BufferedInputStream.java > > :256) > > > > > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java > > :317) > > > > > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java > > :218) > > > > > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java > > :237) > > > > > > > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > > > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > > > > LineRecordReader.java:136) > > > > > > > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > > > UTF8ByteArrayUtils.java:157) > > > > > > > > > > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > > > > PipeMapRed.java:348) > > > > > > > > > > TRACE 300479: > > > > > > > > > > java.io.FileInputStream.readBytes( > > FileInputStream.java:Unknown > > > > > line) > > > > > > > > > > java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java > > :218) > > > > > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java > > :237) > > > > > > > > > > java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > > > > > > org.apache.hadoop.mapred.LineRecordReader.readLine( > > > > > LineRecordReader.java:136) > > > > > > > > > > org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > > > UTF8ByteArrayUtils.java:157) > > > > > > > > > > org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > > > > > PipeMapRed.java:399) > > > > > > > > > > TRACE 300478: > > > > > > > > > > > > > > > java.io.FileOutputStream.writeBytes( > > FileOutputStream.java:Unknownline) > > > > > > > > > > java.io.FileOutputStream.write(FileOutputStream.java:260) > > > > > > > > > > java.io.BufferedOutputStream.flushBuffer( > > > > BufferedOutputStream.java > > > > > :65) > > > > > > > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java > > :123) > > > > > > > > > > java.io.BufferedOutputStream.flush(BufferedOutputStream.java > > :124) > > > > > > > > > > java.io.DataOutputStream.flush(DataOutputStream.java:106) > > > > > > > > > > org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java > > :96) > > > > > > > > > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > > > > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > > > > > org.apache.hadoop.mapred.TaskTracker$Child.main( > > TaskTracker.java > > > > > :1760) > > > > > > > > > > > > > > > I don't understand why Hadoop streaming needs so much CPU time to > > read > > > > > from > > > > > and write to the map program. Note it takes 23.67% time to read from > > the > > > > > standard error of the map program while the program does not output > > any > > > > > error at all! > > > > > > > > > > Does anyone know any way to get rid of this seemingly unnecessary > > > > overhead > > > > > in Hadoop streaming? > > > > > > > > > > Thanks, > > > > > > > > > > Lin > > > > > > > > > > > > > > > > > > > > > -- > > > > Theodore Van Rooy > > > > http://greentheo.scroggles.com > > > > > >
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil