[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840340#action_12840340
 ] 

Arun C Murthy commented on MAPREDUCE-1270:
------------------------------------------

Fusheng, thinking about this a bit more I have a suggestion to help push this 
through the hadoop framework in a more straight-forward manner and help this 
get committed:

I'd propose you guys take existing hadoop pipes, keep _all_ of its apis and 
implement the map-side sort, shuffle and reduce-side merge within pipes itself 
i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark 
the 'C++ data-path' as experimental and co-exist with current functionality, 
thus it will be far easier to get more experience with this.

Currently pipes allows one to implement a C++ RecordReader for the map and a 
C++ RecordWriter for the reduce. We can enhance pipes to collect the 
map-output, sort it in C++ and write out the IFile and index for the 
map-output. The reduces would do the shuffle, merge & 'reduce' call in C++ and 
use the existing infrastructure for the C++ recordwriter to write the outputs.

A note of caution: You will need to worry about TaskCompletionEvents i.e. 
events which let the reduces know the identity and location of completed maps, 
currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for 
this information - and this might be a sticky bit. As an intermediate step, one 
possible way around is to change ReduceTask.java to relay the 
TaskCompletionEvents from the java Child to the C++ reducer.

In terms of development, you could start developing on a svn branch of hadoop 
pipes.

Thoughts?

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to