On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

Has anyone benchmark the performance difference of using Hadoop ?
 1) Java vs C++
 2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ were roughly equal and streaming was 10-20% slower. Most of the cost with streaming came from the stringification.

1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better). In fact, I can even chosen Erlang at the map() and Prolog at the reduce(). Mix and match can optimize me more. 2) I can pick the language that I am familiar with, or one that I like. 3) Easy to switch to another language in a fine-grain incremental way if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* It also supports legacy applications well.

The downsides are that:
  1. The interface is very thin and has minimal functionality.
2. Streaming combiners don't work very well. Many streaming applications buffer in the map
      and run the combiner internally.
3. Streaming doesn't group the values in the reducer. In Java or C+ +, you get:
         key1, (value1, value2, ...)
         key2, (value3, ...)
      in streaming you get
         key1 value1
         key1 value2
         key2 value3
      and your application needs to detect the key changes.
  4. Binary data support has only recently been added to streaming.

Am I missing something here ? or is the majority of Hadoop applications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications are streaming, 1/3 pig, and 1/3 java.

-- Owen

Reply via email to