I agree... as I said in one of the earlier emails, I saw a 50% speedup in a perl script which categorizes O(10^9) rows at a time. Also I wrote a very simple python script (something like a 'cat'), and saw similar speedup. These tests were with 1 Gig files.
We were testing this here at DoubleClick (though it's kind of pointless now given that we have access to Google's MapReduce cluster :-) , and we regularly process 25-100 Gig datasets... the best part of which is that we don't have to rewrite much of our perl, R or bash code. On Mon, Mar 31, 2008 at 7:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > My experiences with Groovy are similar. Noticeable slowdown, but quite > bearable (almost always better than 50% of best attainable speed). > > The highest virtue is that simple programs become simple again. Word > count > is < 5 lines of code. > > > > > On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote: > > > At Metaweb, we did a lot of comparisons between streaming (using Python) > > and native Java, and in general streaming performance was not much > > slower than the native java -- most of the slowdown was from Python > > being a slow language. > > > > The main problems with streaming apps that we found are that they are > > hard to write and there are many ways that you can make simple mistakes > > in streaming that slow down performance. > > > > We've been experimenting with embedding JavaScript (Rhino) and Jython > > for writing jobs, and have found that performance is good and the apps > > are much easier to write. The tight Java integration means that > > performance bottlenecks get rewritten in Java with little sacrifice to > > development speed. One of these days we'll open source these > frameworks. > > > > > > > > Parand Darugar wrote: > >> Travis Brady wrote: > >>> This brings up two interesting issues: > >>> > >>> 1. Hadoop streaming is a potentially very powerful tool, especially > for > >>> those of us who don't work in Java for whatever reason > >>> 2. If Hadoop streaming is "at best a jury rigged solution" then that > >>> should > >>> be made known somewhere on the wiki. If it's really not supposed to > be > >>> used, why is it provided at all? > >>> > >> > >> A set of reasonable performance tests and results would be very > >> helpful in helping people decide whether to go with streaming or not. > >> Hopefully we can get some numbers from this thread and publish them? > >> Anyone else compared streaming with native java? > >> > >> Best, > >> > >> Parand > > > > -- Theodore Van Rooy http://greentheo.scroggles.com
