On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:
Has anyone benchmark the performance difference of using Hadoop ?
1) Java vs C++
2) Java vs Streaming
Yes, a while ago. When I tested it using sort, Java and C++ were
roughly equal and streaming was 10-20% slower. Most of the cost with
streaming came from the stringification.
1) I can pick the language that offers a different programming
paradigm (e.g. I may choose functional language, or logic
programming if they suit the problem better). In fact, I can even
chosen Erlang at the map() and Prolog at the reduce(). Mix and
match can optimize me more.
2) I can pick the language that I am familiar with, or one that I
like.
3) Easy to switch to another language in a fine-grain incremental
way if I choose to do so in future.
Additionally, the interface to streaming is very stable. *smile* It
also supports legacy applications well.
The downsides are that:
1. The interface is very thin and has minimal functionality.
2. Streaming combiners don't work very well. Many streaming
applications buffer in the map
and run the combiner internally.
3. Streaming doesn't group the values in the reducer. In Java or C+
+, you get:
key1, (value1, value2, ...)
key2, (value3, ...)
in streaming you get
key1 value1
key1 value2
key2 value3
and your application needs to detect the key changes.
4. Binary data support has only recently been added to streaming.
Am I missing something here ? or is the majority of Hadoop
applications written in Hadoop Streaming ?
On Yahoo's research clusters, typically 1/3 of the applications are
streaming, 1/3 pig, and 1/3 java.
-- Owen