Re: Hadoop streaming performance problem

2008-04-03 Thread lin
You're right. Java isn't really that slow. I re-examined the Java code for the standalone program and found I was using an unbuffered output method. After I changed it to a buffered method, the Java code running time was comparable to the C++ one. This also means the 1000% speed-up I got was quite

Re: Hadoop streaming performance problem

2008-03-31 Thread Amareshwari Sriramadasu
LineRecordReader.readLine() is deprecated by HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was slow. But streaming still uses the method. HADOOP-2826 (http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in streaming. This change should improve str

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Beg you pardon, Python is a fast language, although simple operations are usually quite more expensive then in lower level languages, but at least when used by somebody who has enough experience, that doesn't matter to much. Actually, in many practical cases, because of project deadlines, C++ (and

Re: Hadoop streaming performance problem

2008-03-31 Thread Theodore Van Rooy
I agree... as I said in one of the earlier emails, I saw a 50% speedup in a perl script which categorizes O(10^9) rows at a time. Also I wrote a very simple python script (something like a 'cat'), and saw similar speedup. These tests were with 1 Gig files. We were testing this here at DoubleClick

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
My experiences with Groovy are similar. Noticeable slowdown, but quite bearable (almost always better than 50% of best attainable speed). The highest virtue is that simple programs become simple again. Word count is < 5 lines of code. On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> w

Re: Hadoop streaming performance problem

2008-03-31 Thread Colin Evans
At Metaweb, we did a lot of comparisons between streaming (using Python) and native Java, and in general streaming performance was not much slower than the native java -- most of the slowdown was from Python being a slow language. The main problems with streaming apps that we found are that th

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Yes. I would have to look up the exact numbers in my Sent folder (I've been reporting all the time to my boss), but it has been a sizeable portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when one included the time to load the data into HDFS. Ah gone ahead and looked it up:

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Because many many people do one enjoy that verbose language you know. (Just replaced an old 754 lines long task with a ported one that take 89 lines.) So as crazy it might sound to some here, hadoop streaming is the primary interface for probably a sizeable part of the "user population". (Users be

Re: Hadoop streaming performance problem

2008-03-31 Thread Travis Brady
> > A set of reasonable performance tests and results would be very helpful > in helping people decide whether to go with streaming or not. Hopefully > we can get some numbers from this thread and publish them? Anyone else > compared streaming with native java? > I think this is a great idea. I t

Re: Hadoop streaming performance problem

2008-03-31 Thread Colin Freas
Really? I would expect the opposite: for compressed files to process slower. You're saying that is not the case, and that compressed files actually increase the speed of jobs? -Colin On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Well, on our EC2/HDFS-on-S3 clus

Re: Hadoop streaming performance problem

2008-03-31 Thread Parand Darugar
Travis Brady wrote: This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wik

Re: Hadoop streaming performance problem

2008-03-31 Thread Jeff Hammerbacher
We have 40 or so engineers who only use streaming at Facebook, so no matter how jury-rigged the solution might be, it's been immensely valuable for developer productivity. As we found with Thrift, letting developers write code in their language of choice has many benefits, including development sp

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
It is provided because of point (2). But that doesn't necessarily make it a good thing to do. The basic idea has real problems. Hive is likely to resolve many of these issues (when it becomes publicly available) but some are inherent with the basic idea of moving data across language barriers t

Re: Hadoop streaming performance problem

2008-03-31 Thread Travis Brady
This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wiki. If it's really not

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
This seems a bit surprising. In my experience well-written Java is generally just about as fast as C++, especially for I/O bound work. The exceptions are: - java startup is still slow. This shouldn't matter much here because you are using streaming anyway so you have java startup + C startup.

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
Well, we would like to use hadoop streaming because our current system is in C++ and it is easier to migrate to hadoop streaming. Also we have very strict performance requirements. Java seems to be too slow. I rewrote the first program in Java and it runs 4 to 5 times slower than the C++ one. On M

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
Hadoop can't split a gzipped file so you will only get as many maps as you have files. Why the obsession with hadoop streaming? It is at best a jury rigged solution. On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > Does Hadoop automatically decompress the gzipped file? I only have a si

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
My previous java.opts was actually "--server -Xmx512m". I increased the heap size to 1024M and the running time was about the same. The resident sizes of the java processes seem to be no greater than 150M. On Mon, Mar 31, 2008 at 1:56 PM, Theodore Van Rooy <[EMAIL PROTECTED]> wrote: > try extendi

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
Does Hadoop automatically decompress the gzipped file? I only have a single input file. Does it have to be splitted and then gzipped? I gzipped the input file and Hadoop only created one map task. Still java is using more than 90% CPU. On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PRO

Re: Hadoop streaming performance problem

2008-03-31 Thread Theodore Van Rooy
try extending the java heap size as well.. I'd be interested to see what kind of an effect that has on time (if any). On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote: > I'm running custom map programs written in C++. What the programs do is > very > simple. For example, in progra

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped, plus it took more than twice as long to upload the data to HDFS-on-S3 in the first place), but still probably relevant. Andreas Am Montag, den

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
I'm running custom map programs written in C++. What the programs do is very simple. For example, in program 2, for each input lineID node1 node2 ... nodeN the program outputs node1 ID node2 ID ... nodeN ID Each node has 4GB to 8GB of memory. The java memory

Re: Hadoop streaming performance problem

2008-03-31 Thread Theodore Van Rooy
are you running a custom map script or a standard linux command like WC? If custom, what does your script do? How much ram do you have? what are you Java memory settings? I used the following setup 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task max. I got the follo

Hadoop streaming performance problem

2008-03-31 Thread lin
Hi, I am looking into using Hadoop streaming to parallelize some simple programs. So far the performance has been pretty disappointing. The cluster contains 5 nodes. Each node has two CPU cores. The task capacity of each node is 2. The Hadoop version is 0.15. Program 1 runs for 3.5 minutes on th