Re: How many people is using Hadoop Streaming ?

Owen O'Malley Fri, 03 Apr 2009 10:00:37 -0700


On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

Has anyone benchmark the performance difference of using Hadoop ?
 1) Java vs C++
 2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ wereroughly equal and streaming was 10-20% slower. Most of the cost withstreaming came from the stringification.

1) I can pick the language that offers a different programmingparadigm (e.g. I may choose functional language, or logicprogramming if they suit the problem better). In fact, I can evenchosen Erlang at the map() and Prolog at the reduce(). Mix andmatch can optimize me more.2) I can pick the language that I am familiar with, or one that Ilike.3) Easy to switch to another language in a fine-grain incrementalway if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* Italso supports legacy applications well.


The downsides are that:
  1. The interface is very thin and has minimal functionality.

2. Streaming combiners don't work very well. Many streamingapplications buffer in the map

      and run the combiner internally.

3. Streaming doesn't group the values in the reducer. In Java or C++, you get:

         key1, (value1, value2, ...)
         key2, (value3, ...)
      in streaming you get
         key1 value1
         key1 value2
         key2 value3
      and your application needs to detect the key changes.
  4. Binary data support has only recently been added to streaming.

Am I missing something here ? or is the majority of Hadoopapplications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications arestreaming, 1/3 pig, and 1/3 java.


-- Owen

Re: How many people is using Hadoop Streaming ?

Reply via email to