[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Fusheng Han (JIRA) Tue, 02 Mar 2010 05:33:52 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840166#action_12840166
 ]


Fusheng Han commented on MAPREDUCE-1270:
----------------------------------------

This project is undergoing inside Baidu. The basic functions have completed. We 
get the HCE(Hadoop C++ Extension) run fluently with Text input and without any 
compression. About 20 percent improvement has achieved compared to Streaming. 
40GB input and 5 nodes are used in this experiment. And MapReduce application 
is wordcounter.

The interfaces exposed to users are similar with PIPES. Mapper interface is 
class Mapper {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t map(MapInput &input) = 0;

protected:
  virtual void emit(const void* key, const int64_t keyLength,
                const void* value, const int64_t valueLength) {
    getContext()->emit(key, keyLength, value, valueLength);
  }
  virtual TaskContext* getContext() {
    return context;
  }
};
Modeled after new hadoop MapReduce interface, setup() and cleanup() functions 
are added here. MapInput is a new defined type for map input. Key and value can 
be retrieved from this object. An emit() function is provided here, which can 
be invoked directly in map() function. Types of key and value are all raw 
memory pointer followed by corresponding length. This is better for non-text 
manipulation.

The Reducer is same with Mapper:
class Reducer {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t reduce(ReduceInput &input) = 0;
  
protected:
  virtual void emit(const void* key, const int64_t keyLength,
                const void* value, const int64_t valueLength) {
    getContext()->emit(key, keyLength, value, valueLength);
  } 
  virtual TaskContext* getContext() {
    return context;
  }
};
A slightly difference is that ReduceInput can get iterative values with next() 
function.

In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here 
we get a little change that Combiner can only emit value (no key parameter in 
emit function). The consideration that omitting key from emit pair of combine 
function is due to mistaken keys may corrupt the order of the map output. The 
output key of emit() funtion is determined by the input.
class Combiner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t combine(ReduceInput &input) = 0;
  
protected:
  virtual void emit(const void* value, const int64_t valueLength) {
    getContext()->emit(getCombineKey(), getCombineKeyLength(), value, 
valueLength);
  } 
  virtual TaskContext* getContext() {
    return context;
  } 
  virtual const void* getCombineKey() {
    return combineKey;
  }
  virtual int64_t getCombineKeyLength() {
    return combineKeyLength;
  }
};

The Partitioner also gets setup() and cleanup() functions:
class Partitioner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup() {return 0;}
  virtual int partition(const void* key, const int64_t keyLength, int 
numOfReduces) = 0;
};

Following pipes, we add a new entry with the name "HCE" in hadoop command. 
Users run command like "hadoop hce XXX" to invoke HCE MapReduce.

We'd like to hear your comments.


> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Reply via email to