Re: Hadoop streaming performance problem
You're right. Java isn't really that slow. I re-examined the Java code for the standalone program and found I was using an unbuffered output method. After I changed it to a buffered method, the Java code running time was comparable to the C++ one. This also means the 1000% speed-up I got was quite wrong. Hadoop runs Java programs a little more efficiently than hadoop streaming, but not significantly. On Mon, Mar 31, 2008 at 3:30 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > This seems a bit surprising. In my experience well-written Java is > generally just about as fast as C++, especially for I/O bound work. The > exceptions are: > > - java startup is still slow. This shouldn't matter much here because you > are using streaming anyway so you have java startup + C startup. > > - java memory use is larger than a perfectly written C++ program and > probably always will be. Of course, getting that perfectly written C++ > program is pretty difficult. > > If the program in question is the one you posted earlier, it is very hard > to > imagine that java would not saturate the I/O channels. Can you confirm > that > you are literally doing: > >for (node : line.split(" ")) { > out.collect(node, key); >} > > And finding a serious difference in speed? > > On 3/31/08 3:23 PM, "lin" <[EMAIL PROTECTED]> wrote: > > > Java seems to be too slow. I rewrote the > > first program in Java and it runs 4 to 5 times slower than the C++ one. > >
Re: Hadoop streaming performance problem
LineRecordReader.readLine() is deprecated by HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was slow. But streaming still uses the method. HADOOP-2826 (http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in streaming. This change should improve streaming performance. When I ran simple cat from streaming, with HADOOP-2826 it ran in 33 seconds whereas with trunk it took 52 seconds. Thanks Amareshwari. lin wrote: Hi, I am looking into using Hadoop streaming to parallelize some simple programs. So far the performance has been pretty disappointing. The cluster contains 5 nodes. Each node has two CPU cores. The task capacity of each node is 2. The Hadoop version is 0.15. Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in standalone (on a single CPU core). Program runs for 5 minutes on the Hadoop cluster and 4.5 minutes in standalone. Both programs run as map-only jobs. I understand that there is some overhead in starting up tasks, reading to and writing from the distributed file system. But they do not seem to explain all the overhead. Most map tasks are data-local. I modified program 1 to output nothing and saw the same magnitude of overhead. The output of top shows that the majority of the CPU time is consumed by Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). So I added a profile option (-agentlib:hprof=cpu=samples) to mapred.child.java.opts. The profile results show that most of CPU time is spent in the following methods rank self accum count trace method 1 23.76% 23.76%1246 300472 java.lang.UNIXProcess.waitForProcessExit 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes And their stack traces show that these methods are for interacting with the map program. TRACE 300472: java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) java.lang.UNIXProcess.access$900(UNIXProcess.java:20) java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) TRACE 300474: java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line) java.io.FileInputStream.read(FileInputStream.java:199) java.io.BufferedInputStream.read1(BufferedInputStream.java:256) java.io.BufferedInputStream.read(BufferedInputStream.java:317) java.io.BufferedInputStream.fill(BufferedInputStream.java:218) java.io.BufferedInputStream.read(BufferedInputStream.java:237) java.io.FilterInputStream.read(FilterInputStream.java:66) org.apache.hadoop.mapred.LineRecordReader.readLine( LineRecordReader.java:136) org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( UTF8ByteArrayUtils.java:157) org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( PipeMapRed.java:348) TRACE 300479: java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line) java.io.FileInputStream.read(FileInputStream.java:199) java.io.BufferedInputStream.fill(BufferedInputStream.java:218) java.io.BufferedInputStream.read(BufferedInputStream.java:237) java.io.FilterInputStream.read(FilterInputStream.java:66) org.apache.hadoop.mapred.LineRecordReader.readLine( LineRecordReader.java:136) org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( UTF8ByteArrayUtils.java:157) org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( PipeMapRed.java:399) TRACE 300478: java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) java.io.FileOutputStream.write(FileOutputStream.java:260) java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java :65) java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) java.io.DataOutputStream.flush(DataOutputStream.java:106) org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java :1760) I don't understand why Hadoop streaming needs so much CPU time to read from and write to the map program. Note it takes 23.67% time to read from the standard error of the map program while the program does not output any error at all! Does anyone know any way to get rid of this seemingly unnecessary overhead in Hadoop streaming? Thanks, Lin
Re: Hadoop streaming performance problem
Beg you pardon, Python is a fast language, although simple operations are usually quite more expensive then in lower level languages, but at least when used by somebody who has enough experience, that doesn't matter to much. Actually, in many practical cases, because of project deadlines, C++ (and to a lesser part Java) implementations end up with the more naive algorithms/designs, and get beaten by Python. In some cases this gets even stronger when the Python guys cheat and implement the inner loop stuff with Pyrex/Cython. (That's observation is not limited to Python, it usually applies to all higher level languages. OTOH, I've beaten C++ in a quite 1:1 development race, so I tend to write Python.) Andreas Am Montag, den 31.03.2008, 18:10 -0700 schrieb Colin Evans: > At Metaweb, we did a lot of comparisons between streaming (using Python) > and native Java, and in general streaming performance was not much > slower than the native java -- most of the slowdown was from Python > being a slow language. > > The main problems with streaming apps that we found are that they are > hard to write and there are many ways that you can make simple mistakes > in streaming that slow down performance. > > We've been experimenting with embedding JavaScript (Rhino) and Jython > for writing jobs, and have found that performance is good and the apps > are much easier to write. The tight Java integration means that > performance bottlenecks get rewritten in Java with little sacrifice to > development speed. One of these days we'll open source these frameworks. > > > > Parand Darugar wrote: > > Travis Brady wrote: > >> This brings up two interesting issues: > >> > >> 1. Hadoop streaming is a potentially very powerful tool, especially for > >> those of us who don't work in Java for whatever reason > >> 2. If Hadoop streaming is "at best a jury rigged solution" then that > >> should > >> be made known somewhere on the wiki. If it's really not supposed to be > >> used, why is it provided at all? > >> > > > > A set of reasonable performance tests and results would be very > > helpful in helping people decide whether to go with streaming or not. > > Hopefully we can get some numbers from this thread and publish them? > > Anyone else compared streaming with native java? > > > > Best, > > > > Parand signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: Hadoop streaming performance problem
I agree... as I said in one of the earlier emails, I saw a 50% speedup in a perl script which categorizes O(10^9) rows at a time. Also I wrote a very simple python script (something like a 'cat'), and saw similar speedup. These tests were with 1 Gig files. We were testing this here at DoubleClick (though it's kind of pointless now given that we have access to Google's MapReduce cluster :-) , and we regularly process 25-100 Gig datasets... the best part of which is that we don't have to rewrite much of our perl, R or bash code. On Mon, Mar 31, 2008 at 7:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > My experiences with Groovy are similar. Noticeable slowdown, but quite > bearable (almost always better than 50% of best attainable speed). > > The highest virtue is that simple programs become simple again. Word > count > is < 5 lines of code. > > > > > On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote: > > > At Metaweb, we did a lot of comparisons between streaming (using Python) > > and native Java, and in general streaming performance was not much > > slower than the native java -- most of the slowdown was from Python > > being a slow language. > > > > The main problems with streaming apps that we found are that they are > > hard to write and there are many ways that you can make simple mistakes > > in streaming that slow down performance. > > > > We've been experimenting with embedding JavaScript (Rhino) and Jython > > for writing jobs, and have found that performance is good and the apps > > are much easier to write. The tight Java integration means that > > performance bottlenecks get rewritten in Java with little sacrifice to > > development speed. One of these days we'll open source these > frameworks. > > > > > > > > Parand Darugar wrote: > >> Travis Brady wrote: > >>> This brings up two interesting issues: > >>> > >>> 1. Hadoop streaming is a potentially very powerful tool, especially > for > >>> those of us who don't work in Java for whatever reason > >>> 2. If Hadoop streaming is "at best a jury rigged solution" then that > >>> should > >>> be made known somewhere on the wiki. If it's really not supposed to > be > >>> used, why is it provided at all? > >>> > >> > >> A set of reasonable performance tests and results would be very > >> helpful in helping people decide whether to go with streaming or not. > >> Hopefully we can get some numbers from this thread and publish them? > >> Anyone else compared streaming with native java? > >> > >> Best, > >> > >> Parand > > > > -- Theodore Van Rooy http://greentheo.scroggles.com
Re: Hadoop streaming performance problem
My experiences with Groovy are similar. Noticeable slowdown, but quite bearable (almost always better than 50% of best attainable speed). The highest virtue is that simple programs become simple again. Word count is < 5 lines of code. On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote: > At Metaweb, we did a lot of comparisons between streaming (using Python) > and native Java, and in general streaming performance was not much > slower than the native java -- most of the slowdown was from Python > being a slow language. > > The main problems with streaming apps that we found are that they are > hard to write and there are many ways that you can make simple mistakes > in streaming that slow down performance. > > We've been experimenting with embedding JavaScript (Rhino) and Jython > for writing jobs, and have found that performance is good and the apps > are much easier to write. The tight Java integration means that > performance bottlenecks get rewritten in Java with little sacrifice to > development speed. One of these days we'll open source these frameworks. > > > > Parand Darugar wrote: >> Travis Brady wrote: >>> This brings up two interesting issues: >>> >>> 1. Hadoop streaming is a potentially very powerful tool, especially for >>> those of us who don't work in Java for whatever reason >>> 2. If Hadoop streaming is "at best a jury rigged solution" then that >>> should >>> be made known somewhere on the wiki. If it's really not supposed to be >>> used, why is it provided at all? >>> >> >> A set of reasonable performance tests and results would be very >> helpful in helping people decide whether to go with streaming or not. >> Hopefully we can get some numbers from this thread and publish them? >> Anyone else compared streaming with native java? >> >> Best, >> >> Parand >
Re: Hadoop streaming performance problem
At Metaweb, we did a lot of comparisons between streaming (using Python) and native Java, and in general streaming performance was not much slower than the native java -- most of the slowdown was from Python being a slow language. The main problems with streaming apps that we found are that they are hard to write and there are many ways that you can make simple mistakes in streaming that slow down performance. We've been experimenting with embedding JavaScript (Rhino) and Jython for writing jobs, and have found that performance is good and the apps are much easier to write. The tight Java integration means that performance bottlenecks get rewritten in Java with little sacrifice to development speed. One of these days we'll open source these frameworks. Parand Darugar wrote: Travis Brady wrote: This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wiki. If it's really not supposed to be used, why is it provided at all? A set of reasonable performance tests and results would be very helpful in helping people decide whether to go with streaming or not. Hopefully we can get some numbers from this thread and publish them? Anyone else compared streaming with native java? Best, Parand
Re: Hadoop streaming performance problem
Yes. I would have to look up the exact numbers in my Sent folder (I've been reporting all the time to my boss), but it has been a sizeable portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when one included the time to load the data into HDFS. Ah gone ahead and looked it up: 93minutes for memberfind on the unzipped input files (+ 45 minutes to upload the 15GB of unzipped data). 70minutes for memberfind on the gzipped files (Critical important: you need to name the files with a .gz suffix or Hadoop won't recognize and unzip it, which trashes the runs :( ) So if I add up the difference it's almost 100% of the runtime of the gzipped run. (input set is 15GB unzipped, 24.8M lines, 1.6GB gzipped). One interesting detail is that I have taken also the time to measure runtime without Hadoop streaming for my input set, and Hadoop took less than 5% time compared to running it directly. (Guess if had run two processes to amortize the time where the VM's "disc" was loading time, I might had managed to speed up the throughput of the "native" run slightly.) The relevant thing here is that the performance hit my Hadoop seemed low enough that I had no case for dropping Hadoop completely. ;) Andreas Am Montag, den 31.03.2008, 18:42 -0400 schrieb Colin Freas: > Really? > > I would expect the opposite: for compressed files to process slower. > > You're saying that is not the case, and that compressed files actually > increase the speed of jobs? > > -Colin > > On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > wrote: > > > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > > provide the input files gzipped. Not great difference (e.g. 50% slower > > when not gzipped, plus it took more than twice as long to upload the > > data to HDFS-on-S3 in the first place), but still probably relevant. > > > > Andreas > > > > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > > I'm running custom map programs written in C++. What the programs do is > > very > > > simple. For example, in program 2, for each input lineID node1 > > node2 > > > ... nodeN > > > the program outputs > > > node1 ID > > > node2 ID > > > ... > > > nodeN ID > > > > > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > > > > > I agree that it depends on the scripts. I tried replicating the > > computation > > > for each input line by 10 times and saw significantly better speedup. > > But it > > > is still pretty bad that Hadoop streaming has such big overhead for > > simple > > > programs. > > > > > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > > > speed up on the cluster. > > > > > > Lin > > > > > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > are you running a custom map script or a standard linux command like > > WC? > > > > If > > > > custom, what does your script do? > > > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > > > I used the following setup > > > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > > task > > > > max. > > > > > > > > I got the following results > > > > > > > > WC 30-40% speedup > > > > Sort 40% speedup > > > > Grep 5X slowdown (turns out this was due to what you described > > above... > > > > Grep > > > > is just very highly optimized for command line) > > > > Custom perl script which is essentially a For loop which matches each > > row > > > > of > > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > > > So I do think that it depends on your script... and some other > > settings of > > > > yours. > > > > > > > > Theo > > > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > > > Hi, > > > > > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > > > programs. So far the performance has been pretty disappointing. > > > > > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > > > capacity > > > > > of each node is 2. The Hadoop version is 0.15. > > > > > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > > in > > > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > > > Hadoop > > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > > > jobs. > > > > > > > > > > I understand that there is some overhead in starting up tasks, > > reading > > > > to > > > > > and writing from the distributed file system. But they do not seem > > to > > > > > explain all the overhead. Most map tasks are data-local. I modified > > > > > program > > > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > > > > > The output of top shows that the majority of the CPU time is > > consumed by >
Re: Hadoop streaming performance problem
Because many many people do one enjoy that verbose language you know. (Just replaced an old 754 lines long task with a ported one that take 89 lines.) So as crazy it might sound to some here, hadoop streaming is the primary interface for probably a sizeable part of the "user population". (Users being developers writing workloads for Hadoop.) Andreas Am Montag, den 31.03.2008, 15:15 -0700 schrieb Ted Dunning: > > Hadoop can't split a gzipped file so you will only get as many maps as you > have files. > > Why the obsession with hadoop streaming? It is at best a jury rigged > solution. > > > On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > > > Does Hadoop automatically decompress the gzipped file? I only have a single > > input file. Does it have to be splitted and then gzipped? > > > > I gzipped the input file and Hadoop only created one map task. Still java is > > using more than 90% CPU. > > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > > wrote: > > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > >> provide the input files gzipped. Not great difference (e.g. 50% slower > >> when not gzipped, plus it took more than twice as long to upload the > >> data to HDFS-on-S3 in the first place), but still probably relevant. > >> > >> Andreas > >> > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > >>> I'm running custom map programs written in C++. What the programs do is > >> very > >>> simple. For example, in program 2, for each input lineID node1 > >> node2 > >>> ... nodeN > >>> the program outputs > >>> node1 ID > >>> node2 ID > >>> ... > >>> nodeN ID > >>> > >>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > >>> > >>> I agree that it depends on the scripts. I tried replicating the > >> computation > >>> for each input line by 10 times and saw significantly better speedup. > >> But it > >>> is still pretty bad that Hadoop streaming has such big overhead for > >> simple > >>> programs. > >>> > >>> I also tried writing program 1 with Hadoop Java API. I got almost 1000% > >>> speed up on the cluster. > >>> > >>> Lin > >>> > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > >>> wrote: > >>> > are you running a custom map script or a standard linux command like > >> WC? > If > custom, what does your script do? > > How much ram do you have? what are you Java memory settings? > > I used the following setup > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > >> task > max. > > I got the following results > > WC 30-40% speedup > Sort 40% speedup > Grep 5X slowdown (turns out this was due to what you described > >> above... > Grep > is just very highly optimized for command line) > Custom perl script which is essentially a For loop which matches each > >> row > of > a dataset to a set of 100 categories) 60% speedup. > > So I do think that it depends on your script... and some other > >> settings of > yours. > > Theo > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am looking into using Hadoop streaming to parallelize some simple > > programs. So far the performance has been pretty disappointing. > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > capacity > > of each node is 2. The Hadoop version is 0.15. > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > >> in > > standalone (on a single CPU core). Program runs for 5 minutes on the > > Hadoop > > cluster and 4.5 minutes in standalone. Both programs run as map-only > jobs. > > > > I understand that there is some overhead in starting up tasks, > >> reading > to > > and writing from the distributed file system. But they do not seem > >> to > > explain all the overhead. Most map tasks are data-local. I modified > > program > > 1 to output nothing and saw the same magnitude of overhead. > > > > The output of top shows that the majority of the CPU time is > >> consumed by > > Hadoop java processes (e.g. > >> org.apache.hadoop.mapred.TaskTracker$Child). > > So > > I added a profile option (-agentlib:hprof=cpu=samples) to > > mapred.child.java.opts. > > > > The profile results show that most of CPU time is spent in the > >> following > > methods > > > > rank self accum count trace method > > > > 1 23.76% 23.76%1246 300472 > java.lang.UNIXProcess.waitForProcessExit > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > 4 16.15% 87.32% 847
Re: Hadoop streaming performance problem
> > A set of reasonable performance tests and results would be very helpful > in helping people decide whether to go with streaming or not. Hopefully > we can get some numbers from this thread and publish them? Anyone else > compared streaming with native java? > I think this is a great idea. I think it'd also be nice to have a small cookbook-ish thing showing equivalent programs in each style. Perhaps a simple streaming example with python or something and it's equivalent using Java. I'd have written this already but I don't know java. On Mon, Mar 31, 2008 at 3:42 PM, Colin Freas <[EMAIL PROTECTED]> wrote: > Really? > > I would expect the opposite: for compressed files to process slower. > > You're saying that is not the case, and that compressed files actually > increase the speed of jobs? > > -Colin > > On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > wrote: > > > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > > provide the input files gzipped. Not great difference (e.g. 50% slower > > when not gzipped, plus it took more than twice as long to upload the > > data to HDFS-on-S3 in the first place), but still probably relevant. > > > > Andreas > > > > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > > I'm running custom map programs written in C++. What the programs do > is > > very > > > simple. For example, in program 2, for each input lineID node1 > > node2 > > > ... nodeN > > > the program outputs > > > node1 ID > > > node2 ID > > > ... > > > nodeN ID > > > > > > Each node has 4GB to 8GB of memory. The java memory setting is > -Xmx300m. > > > > > > I agree that it depends on the scripts. I tried replicating the > > computation > > > for each input line by 10 times and saw significantly better speedup. > > But it > > > is still pretty bad that Hadoop streaming has such big overhead for > > simple > > > programs. > > > > > > I also tried writing program 1 with Hadoop Java API. I got almost > 1000% > > > speed up on the cluster. > > > > > > Lin > > > > > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy < > [EMAIL PROTECTED]> > > > wrote: > > > > > > > are you running a custom map script or a standard linux command like > > WC? > > > > If > > > > custom, what does your script do? > > > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > > > I used the following setup > > > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a > 4 > > task > > > > max. > > > > > > > > I got the following results > > > > > > > > WC 30-40% speedup > > > > Sort 40% speedup > > > > Grep 5X slowdown (turns out this was due to what you described > > above... > > > > Grep > > > > is just very highly optimized for command line) > > > > Custom perl script which is essentially a For loop which matches > each > > row > > > > of > > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > > > So I do think that it depends on your script... and some other > > settings of > > > > yours. > > > > > > > > Theo > > > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > > > Hi, > > > > > > > > > > I am looking into using Hadoop streaming to parallelize some > simple > > > > > programs. So far the performance has been pretty disappointing. > > > > > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The > task > > > > > capacity > > > > > of each node is 2. The Hadoop version is 0.15. > > > > > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > > in > > > > > standalone (on a single CPU core). Program runs for 5 minutes on > the > > > > > Hadoop > > > > > cluster and 4.5 minutes in standalone. Both programs run as > map-only > > > > jobs. > > > > > > > > > > I understand that there is some overhead in starting up tasks, > > reading > > > > to > > > > > and writing from the distributed file system. But they do not seem > > to > > > > > explain all the overhead. Most map tasks are data-local. I > modified > > > > > program > > > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > > > > > The output of top shows that the majority of the CPU time is > > consumed by > > > > > Hadoop java processes (e.g. > > org.apache.hadoop.mapred.TaskTracker$Child). > > > > > So > > > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > > > mapred.child.java.opts. > > > > > > > > > > The profile results show that most of CPU time is spent in the > > following > > > > > methods > > > > > > > > > > rank self accum count trace method > > > > > > > > > > 1 23.76% 23.76%1246 300472 > > > > java.lang.UNIXProcess.waitForProcessExit > > > > > > > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > > > > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > > > > > > > 4 16.15% 87.32% 847 300478 > java.io.FileOutpu
Re: Hadoop streaming performance problem
Really? I would expect the opposite: for compressed files to process slower. You're saying that is not the case, and that compressed files actually increase the speed of jobs? -Colin On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > provide the input files gzipped. Not great difference (e.g. 50% slower > when not gzipped, plus it took more than twice as long to upload the > data to HDFS-on-S3 in the first place), but still probably relevant. > > Andreas > > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > I'm running custom map programs written in C++. What the programs do is > very > > simple. For example, in program 2, for each input lineID node1 > node2 > > ... nodeN > > the program outputs > > node1 ID > > node2 ID > > ... > > nodeN ID > > > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > > > I agree that it depends on the scripts. I tried replicating the > computation > > for each input line by 10 times and saw significantly better speedup. > But it > > is still pretty bad that Hadoop streaming has such big overhead for > simple > > programs. > > > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > > speed up on the cluster. > > > > Lin > > > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > > wrote: > > > > > are you running a custom map script or a standard linux command like > WC? > > > If > > > custom, what does your script do? > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > I used the following setup > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > task > > > max. > > > > > > I got the following results > > > > > > WC 30-40% speedup > > > Sort 40% speedup > > > Grep 5X slowdown (turns out this was due to what you described > above... > > > Grep > > > is just very highly optimized for command line) > > > Custom perl script which is essentially a For loop which matches each > row > > > of > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > So I do think that it depends on your script... and some other > settings of > > > yours. > > > > > > Theo > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > > programs. So far the performance has been pretty disappointing. > > > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > > capacity > > > > of each node is 2. The Hadoop version is 0.15. > > > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > in > > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > > Hadoop > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > > jobs. > > > > > > > > I understand that there is some overhead in starting up tasks, > reading > > > to > > > > and writing from the distributed file system. But they do not seem > to > > > > explain all the overhead. Most map tasks are data-local. I modified > > > > program > > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > > > The output of top shows that the majority of the CPU time is > consumed by > > > > Hadoop java processes (e.g. > org.apache.hadoop.mapred.TaskTracker$Child). > > > > So > > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > > mapred.child.java.opts. > > > > > > > > The profile results show that most of CPU time is spent in the > following > > > > methods > > > > > > > > rank self accum count trace method > > > > > > > > 1 23.76% 23.76%1246 300472 > > > java.lang.UNIXProcess.waitForProcessExit > > > > > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > > > And their stack traces show that these methods are for interacting > with > > > > the > > > > map program. > > > > > > > > > > > > TRACE 300472: > > > > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit( > UNIXProcess.java:Unknownline) > > > > > > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > > > TRACE 300474: > > > > > > > >java.io.FileInputStream.readBytes( > FileInputStream.java:Unknown > > > > line) > > > > > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > > >java.io.BufferedInputStream.read1(BufferedInputStream.java > :256) > > > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java > :317) > > > > > > > >java.io.BufferedInputStream.fill(BufferedInputStream.java
Re: Hadoop streaming performance problem
Travis Brady wrote: This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wiki. If it's really not supposed to be used, why is it provided at all? A set of reasonable performance tests and results would be very helpful in helping people decide whether to go with streaming or not. Hopefully we can get some numbers from this thread and publish them? Anyone else compared streaming with native java? Best, Parand
Re: Hadoop streaming performance problem
We have 40 or so engineers who only use streaming at Facebook, so no matter how jury-rigged the solution might be, it's been immensely valuable for developer productivity. As we found with Thrift, letting developers write code in their language of choice has many benefits, including development speed, reuse of common libraries, and using the right language for the right task. It would be interesting to bake something like streaming into the Hadoop core rather than forcing tab-delimited data through pipes. I know, I know: it's open source, go ahead and add it. Just sayin... On Mon, Mar 31, 2008 at 3:31 PM, Travis Brady <[EMAIL PROTECTED]> wrote: > This brings up two interesting issues: > > 1. Hadoop streaming is a potentially very powerful tool, especially for > those of us who don't work in Java for whatever reason > 2. If Hadoop streaming is "at best a jury rigged solution" then that should > be made known somewhere on the wiki. If it's really not supposed to be > used, why is it provided at all? > > thanks, > Travis > > > > On Mon, Mar 31, 2008 at 3:23 PM, lin <[EMAIL PROTECTED]> wrote: > > > Well, we would like to use hadoop streaming because our current system is > > in > > C++ and it is easier to migrate to hadoop streaming. Also we have very > > strict performance requirements. Java seems to be too slow. I rewrote the > > first program in Java and it runs 4 to 5 times slower than the C++ one. > > > > On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > Hadoop can't split a gzipped file so you will only get as many maps as > > you > > > have files. > > > > > > Why the obsession with hadoop streaming? It is at best a jury rigged > > > solution. > > > > > > > > > On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > > > > > > > Does Hadoop automatically decompress the gzipped file? I only have a > > > single > > > > input file. Does it have to be splitted and then gzipped? > > > > > > > > I gzipped the input file and Hadoop only created one map task. Still > > > java is > > > > using more than 90% CPU. > > > > > > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka < > > [EMAIL PROTECTED]> > > > > wrote: > > > > > > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > > > >> provide the input files gzipped. Not great difference (e.g. 50% > > slower > > > >> when not gzipped, plus it took more than twice as long to upload the > > > >> data to HDFS-on-S3 in the first place), but still probably relevant. > > > >> > > > >> Andreas > > > >> > > > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > > >>> I'm running custom map programs written in C++. What the programs do > > > is > > > >> very > > > >>> simple. For example, in program 2, for each input lineID > > node1 > > > >> node2 > > > >>> ... nodeN > > > >>> the program outputs > > > >>> node1 ID > > > >>> node2 ID > > > >>> ... > > > >>> nodeN ID > > > >>> > > > >>> Each node has 4GB to 8GB of memory. The java memory setting is > > > -Xmx300m. > > > >>> > > > >>> I agree that it depends on the scripts. I tried replicating the > > > >> computation > > > >>> for each input line by 10 times and saw significantly better > > speedup. > > > >> But it > > > >>> is still pretty bad that Hadoop streaming has such big overhead for > > > >> simple > > > >>> programs. > > > >>> > > > >>> I also tried writing program 1 with Hadoop Java API. I got almost > > > 1000% > > > >>> speed up on the cluster. > > > >>> > > > >>> Lin > > > >>> > > > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy < > > > [EMAIL PROTECTED]> > > > >>> wrote: > > > >>> > > > are you running a custom map script or a standard linux command > > like > > > >> WC? > > > If > > > custom, what does your script do? > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > I used the following setup > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a > > 4 > > > >> task > > > max. > > > > > > I got the following results > > > > > > WC 30-40% speedup > > > Sort 40% speedup > > > Grep 5X slowdown (turns out this was due to what you described > > > >> above... > > > Grep > > > is just very highly optimized for command line) > > > Custom perl script which is essentially a For loop which matches > > each > > > >> row > > > of > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > So I do think that it depends on your script... and some other > > > >> settings of > > > yours. > > > > > > Theo > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am looking into using Hadoop streaming t
Re: Hadoop streaming performance problem
It is provided because of point (2). But that doesn't necessarily make it a good thing to do. The basic idea has real problems. Hive is likely to resolve many of these issues (when it becomes publicly available) but some are inherent with the basic idea of moving data across language barriers too many times. On 3/31/08 3:31 PM, "Travis Brady" <[EMAIL PROTECTED]> wrote: > This brings up two interesting issues: > > 1. Hadoop streaming is a potentially very powerful tool, especially for > those of us who don't work in Java for whatever reason > 2. If Hadoop streaming is "at best a jury rigged solution" then that should > be made known somewhere on the wiki. If it's really not supposed to be > used, why is it provided at all? > > thanks, > Travis > > On Mon, Mar 31, 2008 at 3:23 PM, lin <[EMAIL PROTECTED]> wrote: > >> Well, we would like to use hadoop streaming because our current system is >> in >> C++ and it is easier to migrate to hadoop streaming. Also we have very >> strict performance requirements. Java seems to be too slow. I rewrote the >> first program in Java and it runs 4 to 5 times slower than the C++ one. >> >> On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >>> >>> >>> Hadoop can't split a gzipped file so you will only get as many maps as >> you >>> have files. >>> >>> Why the obsession with hadoop streaming? It is at best a jury rigged >>> solution. >>> >>> >>> On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: >>> Does Hadoop automatically decompress the gzipped file? I only have a >>> single input file. Does it have to be splitted and then gzipped? I gzipped the input file and Hadoop only created one map task. Still >>> java is using more than 90% CPU. On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka < >> [EMAIL PROTECTED]> wrote: > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > provide the input files gzipped. Not great difference (e.g. 50% >> slower > when not gzipped, plus it took more than twice as long to upload the > data to HDFS-on-S3 in the first place), but still probably relevant. > > Andreas > > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: >> I'm running custom map programs written in C++. What the programs do >>> is > very >> simple. For example, in program 2, for each input lineID >> node1 > node2 >> ... nodeN >> the program outputs >> node1 ID >> node2 ID >> ... >> nodeN ID >> >> Each node has 4GB to 8GB of memory. The java memory setting is >>> -Xmx300m. >> >> I agree that it depends on the scripts. I tried replicating the > computation >> for each input line by 10 times and saw significantly better >> speedup. > But it >> is still pretty bad that Hadoop streaming has such big overhead for > simple >> programs. >> >> I also tried writing program 1 with Hadoop Java API. I got almost >>> 1000% >> speed up on the cluster. >> >> Lin >> >> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy < >>> [EMAIL PROTECTED]> >> wrote: >> >>> are you running a custom map script or a standard linux command >> like > WC? >>> If >>> custom, what does your script do? >>> >>> How much ram do you have? what are you Java memory settings? >>> >>> I used the following setup >>> >>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a >> 4 > task >>> max. >>> >>> I got the following results >>> >>> WC 30-40% speedup >>> Sort 40% speedup >>> Grep 5X slowdown (turns out this was due to what you described > above... >>> Grep >>> is just very highly optimized for command line) >>> Custom perl script which is essentially a For loop which matches >> each > row >>> of >>> a dataset to a set of 100 categories) 60% speedup. >>> >>> So I do think that it depends on your script... and some other > settings of >>> yours. >>> >>> Theo >>> >>> On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: >>> Hi, I am looking into using Hadoop streaming to parallelize some >> simple programs. So far the performance has been pretty disappointing. The cluster contains 5 nodes. Each node has two CPU cores. The >> task capacity of each node is 2. The Hadoop version is 0.15. Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > in standalone (on a single CPU core). Program runs for 5 minutes on >> the Hadoop cluster and 4.5 minutes in standalone. Both programs run as >> map-only >>> jobs. I understand that there is some overhead in starting up tasks, > reading >>> to and
Re: Hadoop streaming performance problem
This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wiki. If it's really not supposed to be used, why is it provided at all? thanks, Travis On Mon, Mar 31, 2008 at 3:23 PM, lin <[EMAIL PROTECTED]> wrote: > Well, we would like to use hadoop streaming because our current system is > in > C++ and it is easier to migrate to hadoop streaming. Also we have very > strict performance requirements. Java seems to be too slow. I rewrote the > first program in Java and it runs 4 to 5 times slower than the C++ one. > > On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > > > Hadoop can't split a gzipped file so you will only get as many maps as > you > > have files. > > > > Why the obsession with hadoop streaming? It is at best a jury rigged > > solution. > > > > > > On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > > > > > Does Hadoop automatically decompress the gzipped file? I only have a > > single > > > input file. Does it have to be splitted and then gzipped? > > > > > > I gzipped the input file and Hadoop only created one map task. Still > > java is > > > using more than 90% CPU. > > > > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka < > [EMAIL PROTECTED]> > > > wrote: > > > > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > > >> provide the input files gzipped. Not great difference (e.g. 50% > slower > > >> when not gzipped, plus it took more than twice as long to upload the > > >> data to HDFS-on-S3 in the first place), but still probably relevant. > > >> > > >> Andreas > > >> > > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > >>> I'm running custom map programs written in C++. What the programs do > > is > > >> very > > >>> simple. For example, in program 2, for each input lineID > node1 > > >> node2 > > >>> ... nodeN > > >>> the program outputs > > >>> node1 ID > > >>> node2 ID > > >>> ... > > >>> nodeN ID > > >>> > > >>> Each node has 4GB to 8GB of memory. The java memory setting is > > -Xmx300m. > > >>> > > >>> I agree that it depends on the scripts. I tried replicating the > > >> computation > > >>> for each input line by 10 times and saw significantly better > speedup. > > >> But it > > >>> is still pretty bad that Hadoop streaming has such big overhead for > > >> simple > > >>> programs. > > >>> > > >>> I also tried writing program 1 with Hadoop Java API. I got almost > > 1000% > > >>> speed up on the cluster. > > >>> > > >>> Lin > > >>> > > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy < > > [EMAIL PROTECTED]> > > >>> wrote: > > >>> > > are you running a custom map script or a standard linux command > like > > >> WC? > > If > > custom, what does your script do? > > > > How much ram do you have? what are you Java memory settings? > > > > I used the following setup > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a > 4 > > >> task > > max. > > > > I got the following results > > > > WC 30-40% speedup > > Sort 40% speedup > > Grep 5X slowdown (turns out this was due to what you described > > >> above... > > Grep > > is just very highly optimized for command line) > > Custom perl script which is essentially a For loop which matches > each > > >> row > > of > > a dataset to a set of 100 categories) 60% speedup. > > > > So I do think that it depends on your script... and some other > > >> settings of > > yours. > > > > Theo > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am looking into using Hadoop streaming to parallelize some > simple > > > programs. So far the performance has been pretty disappointing. > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The > task > > > capacity > > > of each node is 2. The Hadoop version is 0.15. > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > > >> in > > > standalone (on a single CPU core). Program runs for 5 minutes on > the > > > Hadoop > > > cluster and 4.5 minutes in standalone. Both programs run as > map-only > > jobs. > > > > > > I understand that there is some overhead in starting up tasks, > > >> reading > > to > > > and writing from the distributed file system. But they do not seem > > >> to > > > explain all the overhead. Most map tasks are data-local. I > modified > > > program > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > The output of top shows that the majority of the CPU time is > > >> consumed by > > > Hadoop
Re: Hadoop streaming performance problem
This seems a bit surprising. In my experience well-written Java is generally just about as fast as C++, especially for I/O bound work. The exceptions are: - java startup is still slow. This shouldn't matter much here because you are using streaming anyway so you have java startup + C startup. - java memory use is larger than a perfectly written C++ program and probably always will be. Of course, getting that perfectly written C++ program is pretty difficult. If the program in question is the one you posted earlier, it is very hard to imagine that java would not saturate the I/O channels. Can you confirm that you are literally doing: for (node : line.split(" ")) { out.collect(node, key); } And finding a serious difference in speed? On 3/31/08 3:23 PM, "lin" <[EMAIL PROTECTED]> wrote: > Java seems to be too slow. I rewrote the > first program in Java and it runs 4 to 5 times slower than the C++ one.
Re: Hadoop streaming performance problem
Well, we would like to use hadoop streaming because our current system is in C++ and it is easier to migrate to hadoop streaming. Also we have very strict performance requirements. Java seems to be too slow. I rewrote the first program in Java and it runs 4 to 5 times slower than the C++ one. On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Hadoop can't split a gzipped file so you will only get as many maps as you > have files. > > Why the obsession with hadoop streaming? It is at best a jury rigged > solution. > > > On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > > > Does Hadoop automatically decompress the gzipped file? I only have a > single > > input file. Does it have to be splitted and then gzipped? > > > > I gzipped the input file and Hadoop only created one map task. Still > java is > > using more than 90% CPU. > > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > > wrote: > > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > >> provide the input files gzipped. Not great difference (e.g. 50% slower > >> when not gzipped, plus it took more than twice as long to upload the > >> data to HDFS-on-S3 in the first place), but still probably relevant. > >> > >> Andreas > >> > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > >>> I'm running custom map programs written in C++. What the programs do > is > >> very > >>> simple. For example, in program 2, for each input lineID node1 > >> node2 > >>> ... nodeN > >>> the program outputs > >>> node1 ID > >>> node2 ID > >>> ... > >>> nodeN ID > >>> > >>> Each node has 4GB to 8GB of memory. The java memory setting is > -Xmx300m. > >>> > >>> I agree that it depends on the scripts. I tried replicating the > >> computation > >>> for each input line by 10 times and saw significantly better speedup. > >> But it > >>> is still pretty bad that Hadoop streaming has such big overhead for > >> simple > >>> programs. > >>> > >>> I also tried writing program 1 with Hadoop Java API. I got almost > 1000% > >>> speed up on the cluster. > >>> > >>> Lin > >>> > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy < > [EMAIL PROTECTED]> > >>> wrote: > >>> > are you running a custom map script or a standard linux command like > >> WC? > If > custom, what does your script do? > > How much ram do you have? what are you Java memory settings? > > I used the following setup > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > >> task > max. > > I got the following results > > WC 30-40% speedup > Sort 40% speedup > Grep 5X slowdown (turns out this was due to what you described > >> above... > Grep > is just very highly optimized for command line) > Custom perl script which is essentially a For loop which matches each > >> row > of > a dataset to a set of 100 categories) 60% speedup. > > So I do think that it depends on your script... and some other > >> settings of > yours. > > Theo > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am looking into using Hadoop streaming to parallelize some simple > > programs. So far the performance has been pretty disappointing. > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > capacity > > of each node is 2. The Hadoop version is 0.15. > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > >> in > > standalone (on a single CPU core). Program runs for 5 minutes on the > > Hadoop > > cluster and 4.5 minutes in standalone. Both programs run as map-only > jobs. > > > > I understand that there is some overhead in starting up tasks, > >> reading > to > > and writing from the distributed file system. But they do not seem > >> to > > explain all the overhead. Most map tasks are data-local. I modified > > program > > 1 to output nothing and saw the same magnitude of overhead. > > > > The output of top shows that the majority of the CPU time is > >> consumed by > > Hadoop java processes (e.g. > >> org.apache.hadoop.mapred.TaskTracker$Child). > > So > > I added a profile option (-agentlib:hprof=cpu=samples) to > > mapred.child.java.opts. > > > > The profile results show that most of CPU time is spent in the > >> following > > methods > > > > rank self accum count trace method > > > > 1 23.76% 23.76%1246 300472 > java.lang.UNIXProcess.waitForProcessExit > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > And thei
Re: Hadoop streaming performance problem
Hadoop can't split a gzipped file so you will only get as many maps as you have files. Why the obsession with hadoop streaming? It is at best a jury rigged solution. On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > Does Hadoop automatically decompress the gzipped file? I only have a single > input file. Does it have to be splitted and then gzipped? > > I gzipped the input file and Hadoop only created one map task. Still java is > using more than 90% CPU. > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> > wrote: > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to >> provide the input files gzipped. Not great difference (e.g. 50% slower >> when not gzipped, plus it took more than twice as long to upload the >> data to HDFS-on-S3 in the first place), but still probably relevant. >> >> Andreas >> >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: >>> I'm running custom map programs written in C++. What the programs do is >> very >>> simple. For example, in program 2, for each input lineID node1 >> node2 >>> ... nodeN >>> the program outputs >>> node1 ID >>> node2 ID >>> ... >>> nodeN ID >>> >>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. >>> >>> I agree that it depends on the scripts. I tried replicating the >> computation >>> for each input line by 10 times and saw significantly better speedup. >> But it >>> is still pretty bad that Hadoop streaming has such big overhead for >> simple >>> programs. >>> >>> I also tried writing program 1 with Hadoop Java API. I got almost 1000% >>> speed up on the cluster. >>> >>> Lin >>> >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> >>> wrote: >>> are you running a custom map script or a standard linux command like >> WC? If custom, what does your script do? How much ram do you have? what are you Java memory settings? I used the following setup 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 >> task max. I got the following results WC 30-40% speedup Sort 40% speedup Grep 5X slowdown (turns out this was due to what you described >> above... Grep is just very highly optimized for command line) Custom perl script which is essentially a For loop which matches each >> row of a dataset to a set of 100 categories) 60% speedup. So I do think that it depends on your script... and some other >> settings of yours. Theo On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > Hi, > > I am looking into using Hadoop streaming to parallelize some simple > programs. So far the performance has been pretty disappointing. > > The cluster contains 5 nodes. Each node has two CPU cores. The task > capacity > of each node is 2. The Hadoop version is 0.15. > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes >> in > standalone (on a single CPU core). Program runs for 5 minutes on the > Hadoop > cluster and 4.5 minutes in standalone. Both programs run as map-only jobs. > > I understand that there is some overhead in starting up tasks, >> reading to > and writing from the distributed file system. But they do not seem >> to > explain all the overhead. Most map tasks are data-local. I modified > program > 1 to output nothing and saw the same magnitude of overhead. > > The output of top shows that the majority of the CPU time is >> consumed by > Hadoop java processes (e.g. >> org.apache.hadoop.mapred.TaskTracker$Child). > So > I added a profile option (-agentlib:hprof=cpu=samples) to > mapred.child.java.opts. > > The profile results show that most of CPU time is spent in the >> following > methods > > rank self accum count trace method > > 1 23.76% 23.76%1246 300472 java.lang.UNIXProcess.waitForProcessExit > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > And their stack traces show that these methods are for interacting >> with > the > map program. > > > TRACE 300472: > > > java.lang.UNIXProcess.waitForProcessExit( >> UNIXProcess.java:Unknownline) > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > TRACE 300474: > >java.io.FileInputStream.readBytes( >> FileInputStream.java:Unknown > line) > >java.io.FileInputStream.read(FileInputStream.java:199) > >java.io.BufferedInputStream.re
Re: Hadoop streaming performance problem
My previous java.opts was actually "--server -Xmx512m". I increased the heap size to 1024M and the running time was about the same. The resident sizes of the java processes seem to be no greater than 150M. On Mon, Mar 31, 2008 at 1:56 PM, Theodore Van Rooy <[EMAIL PROTECTED]> wrote: > try extending the java heap size as well.. I'd be interested to see what > kind of an effect that has on time (if any). > > > > On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote: > > > I'm running custom map programs written in C++. What the programs do is > > very > > simple. For example, in program 2, for each input lineID node1 > > node2 > > ... nodeN > > the program outputs > >node1 ID > >node2 ID > >... > >nodeN ID > > > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > > > I agree that it depends on the scripts. I tried replicating the > > computation > > for each input line by 10 times and saw significantly better speedup. > But > > it > > is still pretty bad that Hadoop streaming has such big overhead for > simple > > programs. > > > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > > speed up on the cluster. > > > > Lin > > > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > > wrote: > > > > > are you running a custom map script or a standard linux command like > WC? > > > If > > > custom, what does your script do? > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > I used the following setup > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > > task > > > max. > > > > > > I got the following results > > > > > > WC 30-40% speedup > > > Sort 40% speedup > > > Grep 5X slowdown (turns out this was due to what you described > above... > > > Grep > > > is just very highly optimized for command line) > > > Custom perl script which is essentially a For loop which matches each > > row > > > of > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > So I do think that it depends on your script... and some other > settings > > of > > > yours. > > > > > > Theo > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > > programs. So far the performance has been pretty disappointing. > > > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > > capacity > > > > of each node is 2. The Hadoop version is 0.15. > > > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > in > > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > > Hadoop > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > > jobs. > > > > > > > > I understand that there is some overhead in starting up tasks, > reading > > > to > > > > and writing from the distributed file system. But they do not seem > to > > > > explain all the overhead. Most map tasks are data-local. I modified > > > > program > > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > > > The output of top shows that the majority of the CPU time is > consumed > > by > > > > Hadoop java processes (e.g. > > org.apache.hadoop.mapred.TaskTracker$Child). > > > > So > > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > > mapred.child.java.opts. > > > > > > > > The profile results show that most of CPU time is spent in the > > following > > > > methods > > > > > > > > rank self accum count trace method > > > > > > > > 1 23.76% 23.76%1246 300472 > > > java.lang.UNIXProcess.waitForProcessExit > > > > > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > > > And their stack traces show that these methods are for interacting > > with > > > > the > > > > map program. > > > > > > > > > > > > TRACE 300472: > > > > > > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > > > > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > > > TRACE 300474: > > > > > > > >java.io.FileInputStream.readBytes( > FileInputStream.java:Unknown > > > > line) > > > > > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > > >java.io.BufferedInputStream.read1(BufferedInputStream.java > :256) > > > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java > :317) > > > > > > > >java.io.BufferedInputStream.fill(BufferedInputStream.java > :218) > > > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java > :237) > > > > > >
Re: Hadoop streaming performance problem
Does Hadoop automatically decompress the gzipped file? I only have a single input file. Does it have to be splitted and then gzipped? I gzipped the input file and Hadoop only created one map task. Still java is using more than 90% CPU. On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to > provide the input files gzipped. Not great difference (e.g. 50% slower > when not gzipped, plus it took more than twice as long to upload the > data to HDFS-on-S3 in the first place), but still probably relevant. > > Andreas > > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > > I'm running custom map programs written in C++. What the programs do is > very > > simple. For example, in program 2, for each input lineID node1 > node2 > > ... nodeN > > the program outputs > > node1 ID > > node2 ID > > ... > > nodeN ID > > > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > > > I agree that it depends on the scripts. I tried replicating the > computation > > for each input line by 10 times and saw significantly better speedup. > But it > > is still pretty bad that Hadoop streaming has such big overhead for > simple > > programs. > > > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > > speed up on the cluster. > > > > Lin > > > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > > wrote: > > > > > are you running a custom map script or a standard linux command like > WC? > > > If > > > custom, what does your script do? > > > > > > How much ram do you have? what are you Java memory settings? > > > > > > I used the following setup > > > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > task > > > max. > > > > > > I got the following results > > > > > > WC 30-40% speedup > > > Sort 40% speedup > > > Grep 5X slowdown (turns out this was due to what you described > above... > > > Grep > > > is just very highly optimized for command line) > > > Custom perl script which is essentially a For loop which matches each > row > > > of > > > a dataset to a set of 100 categories) 60% speedup. > > > > > > So I do think that it depends on your script... and some other > settings of > > > yours. > > > > > > Theo > > > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > > programs. So far the performance has been pretty disappointing. > > > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > > capacity > > > > of each node is 2. The Hadoop version is 0.15. > > > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes > in > > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > > Hadoop > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > > jobs. > > > > > > > > I understand that there is some overhead in starting up tasks, > reading > > > to > > > > and writing from the distributed file system. But they do not seem > to > > > > explain all the overhead. Most map tasks are data-local. I modified > > > > program > > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > > > The output of top shows that the majority of the CPU time is > consumed by > > > > Hadoop java processes (e.g. > org.apache.hadoop.mapred.TaskTracker$Child). > > > > So > > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > > mapred.child.java.opts. > > > > > > > > The profile results show that most of CPU time is spent in the > following > > > > methods > > > > > > > > rank self accum count trace method > > > > > > > > 1 23.76% 23.76%1246 300472 > > > java.lang.UNIXProcess.waitForProcessExit > > > > > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > > > And their stack traces show that these methods are for interacting > with > > > > the > > > > map program. > > > > > > > > > > > > TRACE 300472: > > > > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit( > UNIXProcess.java:Unknownline) > > > > > > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > > > TRACE 300474: > > > > > > > >java.io.FileInputStream.readBytes( > FileInputStream.java:Unknown > > > > line) > > > > > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > > > > > >java.io.BufferedInputStream.read1(BufferedInputStream.java > :256) > > > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java > :317) > > > > > > > >java.io
Re: Hadoop streaming performance problem
try extending the java heap size as well.. I'd be interested to see what kind of an effect that has on time (if any). On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote: > I'm running custom map programs written in C++. What the programs do is > very > simple. For example, in program 2, for each input lineID node1 > node2 > ... nodeN > the program outputs >node1 ID >node2 ID >... >nodeN ID > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > I agree that it depends on the scripts. I tried replicating the > computation > for each input line by 10 times and saw significantly better speedup. But > it > is still pretty bad that Hadoop streaming has such big overhead for simple > programs. > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > speed up on the cluster. > > Lin > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > wrote: > > > are you running a custom map script or a standard linux command like WC? > > If > > custom, what does your script do? > > > > How much ram do you have? what are you Java memory settings? > > > > I used the following setup > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 > task > > max. > > > > I got the following results > > > > WC 30-40% speedup > > Sort 40% speedup > > Grep 5X slowdown (turns out this was due to what you described above... > > Grep > > is just very highly optimized for command line) > > Custom perl script which is essentially a For loop which matches each > row > > of > > a dataset to a set of 100 categories) 60% speedup. > > > > So I do think that it depends on your script... and some other settings > of > > yours. > > > > Theo > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > programs. So far the performance has been pretty disappointing. > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > capacity > > > of each node is 2. The Hadoop version is 0.15. > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > Hadoop > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > jobs. > > > > > > I understand that there is some overhead in starting up tasks, reading > > to > > > and writing from the distributed file system. But they do not seem to > > > explain all the overhead. Most map tasks are data-local. I modified > > > program > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > The output of top shows that the majority of the CPU time is consumed > by > > > Hadoop java processes (e.g. > org.apache.hadoop.mapred.TaskTracker$Child). > > > So > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > mapred.child.java.opts. > > > > > > The profile results show that most of CPU time is spent in the > following > > > methods > > > > > > rank self accum count trace method > > > > > > 1 23.76% 23.76%1246 300472 > > java.lang.UNIXProcess.waitForProcessExit > > > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > And their stack traces show that these methods are for interacting > with > > > the > > > map program. > > > > > > > > > TRACE 300472: > > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > TRACE 300474: > > > > > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > > > >java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > > >java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > >java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > >org.apache.hadoop.mapred.LineRecordReader.readLine( > > > LineRecordReader.java:136) > > > > > >org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > UTF8ByteArrayUtils.java:157) > > > > > >org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > > PipeMapRed.java:348) > > > > > > TRACE 300479: > > > > > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > > > >j
Re: Hadoop streaming performance problem
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped, plus it took more than twice as long to upload the data to HDFS-on-S3 in the first place), but still probably relevant. Andreas Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin: > I'm running custom map programs written in C++. What the programs do is very > simple. For example, in program 2, for each input lineID node1 node2 > ... nodeN > the program outputs > node1 ID > node2 ID > ... > nodeN ID > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. > > I agree that it depends on the scripts. I tried replicating the computation > for each input line by 10 times and saw significantly better speedup. But it > is still pretty bad that Hadoop streaming has such big overhead for simple > programs. > > I also tried writing program 1 with Hadoop Java API. I got almost 1000% > speed up on the cluster. > > Lin > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> > wrote: > > > are you running a custom map script or a standard linux command like WC? > > If > > custom, what does your script do? > > > > How much ram do you have? what are you Java memory settings? > > > > I used the following setup > > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task > > max. > > > > I got the following results > > > > WC 30-40% speedup > > Sort 40% speedup > > Grep 5X slowdown (turns out this was due to what you described above... > > Grep > > is just very highly optimized for command line) > > Custom perl script which is essentially a For loop which matches each row > > of > > a dataset to a set of 100 categories) 60% speedup. > > > > So I do think that it depends on your script... and some other settings of > > yours. > > > > Theo > > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am looking into using Hadoop streaming to parallelize some simple > > > programs. So far the performance has been pretty disappointing. > > > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > > capacity > > > of each node is 2. The Hadoop version is 0.15. > > > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > > > standalone (on a single CPU core). Program runs for 5 minutes on the > > > Hadoop > > > cluster and 4.5 minutes in standalone. Both programs run as map-only > > jobs. > > > > > > I understand that there is some overhead in starting up tasks, reading > > to > > > and writing from the distributed file system. But they do not seem to > > > explain all the overhead. Most map tasks are data-local. I modified > > > program > > > 1 to output nothing and saw the same magnitude of overhead. > > > > > > The output of top shows that the majority of the CPU time is consumed by > > > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). > > > So > > > I added a profile option (-agentlib:hprof=cpu=samples) to > > > mapred.child.java.opts. > > > > > > The profile results show that most of CPU time is spent in the following > > > methods > > > > > > rank self accum count trace method > > > > > > 1 23.76% 23.76%1246 300472 > > java.lang.UNIXProcess.waitForProcessExit > > > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > > > And their stack traces show that these methods are for interacting with > > > the > > > map program. > > > > > > > > > TRACE 300472: > > > > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > > > TRACE 300474: > > > > > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > > line) > > > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > > > >java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > > >java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > >java.io.FilterInputStream.read(FilterInputStream.java:66) > > > > > >org.apache.hadoop.mapred.LineRecordReader.readLine( > > > LineRecordReader.java:136) > > > > > >org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > > UTF8ByteArrayUtils.java:157) > > > > > >org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > > PipeMapRed.java:348) > > > > > > TRACE 300479: > > > > > >java.io.FileInputStream.readByte
Re: Hadoop streaming performance problem
I'm running custom map programs written in C++. What the programs do is very simple. For example, in program 2, for each input lineID node1 node2 ... nodeN the program outputs node1 ID node2 ID ... nodeN ID Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m. I agree that it depends on the scripts. I tried replicating the computation for each input line by 10 times and saw significantly better speedup. But it is still pretty bad that Hadoop streaming has such big overhead for simple programs. I also tried writing program 1 with Hadoop Java API. I got almost 1000% speed up on the cluster. Lin On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]> wrote: > are you running a custom map script or a standard linux command like WC? > If > custom, what does your script do? > > How much ram do you have? what are you Java memory settings? > > I used the following setup > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task > max. > > I got the following results > > WC 30-40% speedup > Sort 40% speedup > Grep 5X slowdown (turns out this was due to what you described above... > Grep > is just very highly optimized for command line) > Custom perl script which is essentially a For loop which matches each row > of > a dataset to a set of 100 categories) 60% speedup. > > So I do think that it depends on your script... and some other settings of > yours. > > Theo > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am looking into using Hadoop streaming to parallelize some simple > > programs. So far the performance has been pretty disappointing. > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task > > capacity > > of each node is 2. The Hadoop version is 0.15. > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > > standalone (on a single CPU core). Program runs for 5 minutes on the > > Hadoop > > cluster and 4.5 minutes in standalone. Both programs run as map-only > jobs. > > > > I understand that there is some overhead in starting up tasks, reading > to > > and writing from the distributed file system. But they do not seem to > > explain all the overhead. Most map tasks are data-local. I modified > > program > > 1 to output nothing and saw the same magnitude of overhead. > > > > The output of top shows that the majority of the CPU time is consumed by > > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). > > So > > I added a profile option (-agentlib:hprof=cpu=samples) to > > mapred.child.java.opts. > > > > The profile results show that most of CPU time is spent in the following > > methods > > > > rank self accum count trace method > > > > 1 23.76% 23.76%1246 300472 > java.lang.UNIXProcess.waitForProcessExit > > > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > > > And their stack traces show that these methods are for interacting with > > the > > map program. > > > > > > TRACE 300472: > > > > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > > > TRACE 300474: > > > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > line) > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > >java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > >java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > >java.io.FilterInputStream.read(FilterInputStream.java:66) > > > >org.apache.hadoop.mapred.LineRecordReader.readLine( > > LineRecordReader.java:136) > > > >org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > UTF8ByteArrayUtils.java:157) > > > >org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > > PipeMapRed.java:348) > > > > TRACE 300479: > > > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > > line) > > > >java.io.FileInputStream.read(FileInputStream.java:199) > > > >java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > >java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > >java.io.FilterInputStream.read(FilterInputStream.java:66) > > > >org.apache.hadoop.mapred.LineRecordReader.readLine( > > LineRecordReader.java:136) > > > >org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > > UTF8ByteArrayUtils.java:157) > > > >org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.ru
Re: Hadoop streaming performance problem
are you running a custom map script or a standard linux command like WC? If custom, what does your script do? How much ram do you have? what are you Java memory settings? I used the following setup 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task max. I got the following results WC 30-40% speedup Sort 40% speedup Grep 5X slowdown (turns out this was due to what you described above... Grep is just very highly optimized for command line) Custom perl script which is essentially a For loop which matches each row of a dataset to a set of 100 categories) 60% speedup. So I do think that it depends on your script... and some other settings of yours. Theo On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote: > Hi, > > I am looking into using Hadoop streaming to parallelize some simple > programs. So far the performance has been pretty disappointing. > > The cluster contains 5 nodes. Each node has two CPU cores. The task > capacity > of each node is 2. The Hadoop version is 0.15. > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in > standalone (on a single CPU core). Program runs for 5 minutes on the > Hadoop > cluster and 4.5 minutes in standalone. Both programs run as map-only jobs. > > I understand that there is some overhead in starting up tasks, reading to > and writing from the distributed file system. But they do not seem to > explain all the overhead. Most map tasks are data-local. I modified > program > 1 to output nothing and saw the same magnitude of overhead. > > The output of top shows that the majority of the CPU time is consumed by > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). > So > I added a profile option (-agentlib:hprof=cpu=samples) to > mapred.child.java.opts. > > The profile results show that most of CPU time is spent in the following > methods > > rank self accum count trace method > > 1 23.76% 23.76%1246 300472 java.lang.UNIXProcess.waitForProcessExit > > 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes > > 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes > > 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes > > And their stack traces show that these methods are for interacting with > the > map program. > > > TRACE 300472: > > > java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) > > TRACE 300474: > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > line) > >java.io.FileInputStream.read(FileInputStream.java:199) > >java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > >java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > >java.io.BufferedInputStream.read(BufferedInputStream.java:237) > >java.io.FilterInputStream.read(FilterInputStream.java:66) > >org.apache.hadoop.mapred.LineRecordReader.readLine( > LineRecordReader.java:136) > >org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > UTF8ByteArrayUtils.java:157) > >org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( > PipeMapRed.java:348) > > TRACE 300479: > >java.io.FileInputStream.readBytes(FileInputStream.java:Unknown > line) > >java.io.FileInputStream.read(FileInputStream.java:199) > >java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > >java.io.BufferedInputStream.read(BufferedInputStream.java:237) > >java.io.FilterInputStream.read(FilterInputStream.java:66) > >org.apache.hadoop.mapred.LineRecordReader.readLine( > LineRecordReader.java:136) > >org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( > UTF8ByteArrayUtils.java:157) > >org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( > PipeMapRed.java:399) > > TRACE 300478: > > > java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) > >java.io.FileOutputStream.write(FileOutputStream.java:260) > >java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java > :65) > >java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > >java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) > >java.io.DataOutputStream.flush(DataOutputStream.java:106) > >org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) > >org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) >org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > :1760) > > > I don't understand why Hadoop streaming needs so much CPU time to read > from > and write to the map program. Note it takes 23.67% time to read from the > standard error of the map program while the
Hadoop streaming performance problem
Hi, I am looking into using Hadoop streaming to parallelize some simple programs. So far the performance has been pretty disappointing. The cluster contains 5 nodes. Each node has two CPU cores. The task capacity of each node is 2. The Hadoop version is 0.15. Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in standalone (on a single CPU core). Program runs for 5 minutes on the Hadoop cluster and 4.5 minutes in standalone. Both programs run as map-only jobs. I understand that there is some overhead in starting up tasks, reading to and writing from the distributed file system. But they do not seem to explain all the overhead. Most map tasks are data-local. I modified program 1 to output nothing and saw the same magnitude of overhead. The output of top shows that the majority of the CPU time is consumed by Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). So I added a profile option (-agentlib:hprof=cpu=samples) to mapred.child.java.opts. The profile results show that most of CPU time is spent in the following methods rank self accum count trace method 1 23.76% 23.76%1246 300472 java.lang.UNIXProcess.waitForProcessExit 2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes 3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes 4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes And their stack traces show that these methods are for interacting with the map program. TRACE 300472: java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline) java.lang.UNIXProcess.access$900(UNIXProcess.java:20) java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) TRACE 300474: java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line) java.io.FileInputStream.read(FileInputStream.java:199) java.io.BufferedInputStream.read1(BufferedInputStream.java:256) java.io.BufferedInputStream.read(BufferedInputStream.java:317) java.io.BufferedInputStream.fill(BufferedInputStream.java:218) java.io.BufferedInputStream.read(BufferedInputStream.java:237) java.io.FilterInputStream.read(FilterInputStream.java:66) org.apache.hadoop.mapred.LineRecordReader.readLine( LineRecordReader.java:136) org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( UTF8ByteArrayUtils.java:157) org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run( PipeMapRed.java:348) TRACE 300479: java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line) java.io.FileInputStream.read(FileInputStream.java:199) java.io.BufferedInputStream.fill(BufferedInputStream.java:218) java.io.BufferedInputStream.read(BufferedInputStream.java:237) java.io.FilterInputStream.read(FilterInputStream.java:66) org.apache.hadoop.mapred.LineRecordReader.readLine( LineRecordReader.java:136) org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine( UTF8ByteArrayUtils.java:157) org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run( PipeMapRed.java:399) TRACE 300478: java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline) java.io.FileOutputStream.write(FileOutputStream.java:260) java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java :65) java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124) java.io.DataOutputStream.flush(DataOutputStream.java:106) org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96) org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java :1760) I don't understand why Hadoop streaming needs so much CPU time to read from and write to the map program. Note it takes 23.67% time to read from the standard error of the map program while the program does not output any error at all! Does anyone know any way to get rid of this seemingly unnecessary overhead in Hadoop streaming? Thanks, Lin