You're right. Java isn't really that slow. I re-examined the Java code for
the standalone program and found I was using an unbuffered output method.
After I changed it to a buffered method, the Java code running time was
comparable to the C++ one. This also means the 1000% speed-up I got was
quite
LineRecordReader.readLine() is deprecated by
HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was
slow.
But streaming still uses the method. HADOOP-2826
(http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in
streaming.
This change should improve str
Beg you pardon, Python is a fast language, although simple operations
are usually quite more expensive then in lower level languages, but at
least when used by somebody who has enough experience, that doesn't
matter to much. Actually, in many practical cases, because of project
deadlines, C++ (and
I agree... as I said in one of the earlier emails, I saw a 50% speedup in a
perl script which categorizes O(10^9) rows at a time. Also I wrote a very
simple python script (something like a 'cat'), and saw similar speedup.
These tests were with 1 Gig files.
We were testing this here at DoubleClick
My experiences with Groovy are similar. Noticeable slowdown, but quite
bearable (almost always better than 50% of best attainable speed).
The highest virtue is that simple programs become simple again. Word count
is < 5 lines of code.
On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> w
At Metaweb, we did a lot of comparisons between streaming (using Python)
and native Java, and in general streaming performance was not much
slower than the native java -- most of the slowdown was from Python
being a slow language.
The main problems with streaming apps that we found are that th
Yes. I would have to look up the exact numbers in my Sent folder (I've
been reporting all the time to my boss), but it has been a sizeable
portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when
one included the time to load the data into HDFS.
Ah gone ahead and looked it up:
Because many many people do one enjoy that verbose language you know.
(Just replaced an old 754 lines long task with a ported one that take 89
lines.)
So as crazy it might sound to some here, hadoop streaming is the primary
interface for probably a sizeable part of the "user population". (Users
be
>
> A set of reasonable performance tests and results would be very helpful
> in helping people decide whether to go with streaming or not. Hopefully
> we can get some numbers from this thread and publish them? Anyone else
> compared streaming with native java?
>
I think this is a great idea. I t
Really?
I would expect the opposite: for compressed files to process slower.
You're saying that is not the case, and that compressed files actually
increase the speed of jobs?
-Colin
On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]>
wrote:
> Well, on our EC2/HDFS-on-S3 clus
Travis Brady wrote:
This brings up two interesting issues:
1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that should
be made known somewhere on the wik
We have 40 or so engineers who only use streaming at Facebook, so no
matter how jury-rigged the solution might be, it's been immensely
valuable for developer productivity. As we found with Thrift, letting
developers write code in their language of choice has many benefits,
including development sp
It is provided because of point (2).
But that doesn't necessarily make it a good thing to do. The basic idea has
real problems.
Hive is likely to resolve many of these issues (when it becomes publicly
available) but some are inherent with the basic idea of moving data across
language barriers t
This brings up two interesting issues:
1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that should
be made known somewhere on the wiki. If it's really not
This seems a bit surprising. In my experience well-written Java is
generally just about as fast as C++, especially for I/O bound work. The
exceptions are:
- java startup is still slow. This shouldn't matter much here because you
are using streaming anyway so you have java startup + C startup.
Well, we would like to use hadoop streaming because our current system is in
C++ and it is easier to migrate to hadoop streaming. Also we have very
strict performance requirements. Java seems to be too slow. I rewrote the
first program in Java and it runs 4 to 5 times slower than the C++ one.
On M
Hadoop can't split a gzipped file so you will only get as many maps as you
have files.
Why the obsession with hadoop streaming? It is at best a jury rigged
solution.
On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote:
> Does Hadoop automatically decompress the gzipped file? I only have a si
My previous java.opts was actually "--server -Xmx512m". I increased the heap
size to 1024M and the running time was about the same. The resident sizes of
the java processes seem to be no greater than 150M.
On Mon, Mar 31, 2008 at 1:56 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
wrote:
> try extendi
Does Hadoop automatically decompress the gzipped file? I only have a single
input file. Does it have to be splitted and then gzipped?
I gzipped the input file and Hadoop only created one map task. Still java is
using more than 90% CPU.
On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PRO
try extending the java heap size as well.. I'd be interested to see what
kind of an effect that has on time (if any).
On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote:
> I'm running custom map programs written in C++. What the programs do is
> very
> simple. For example, in progra
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
provide the input files gzipped. Not great difference (e.g. 50% slower
when not gzipped, plus it took more than twice as long to upload the
data to HDFS-on-S3 in the first place), but still probably relevant.
Andreas
Am Montag, den
I'm running custom map programs written in C++. What the programs do is very
simple. For example, in program 2, for each input lineID node1 node2
... nodeN
the program outputs
node1 ID
node2 ID
...
nodeN ID
Each node has 4GB to 8GB of memory. The java memory
are you running a custom map script or a standard linux command like WC? If
custom, what does your script do?
How much ram do you have? what are you Java memory settings?
I used the following setup
2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task
max.
I got the follo
Hi,
I am looking into using Hadoop streaming to parallelize some simple
programs. So far the performance has been pretty disappointing.
The cluster contains 5 nodes. Each node has two CPU cores. The task capacity
of each node is 2. The Hadoop version is 0.15.
Program 1 runs for 3.5 minutes on th
24 matches
Mail list logo