[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Fusheng Han (JIRA) Wed, 03 Mar 2010 06:16:51 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840666#action_12840666
 ]


Fusheng Han commented on MAPREDUCE-1270:
----------------------------------------

Arun, I appreciate your comments.

The bad news is that our design document is written in Chinese. My team members 
and I will put some design details step by step in the next few days.

For Q3, we indeed change the interface of Combiner, while the semantics for 
Combiner is the same with Java Map-Reduce. It prevents mistaken use of 
Combiner. In the situation that two spills with sorted records will merge into 
file.out (the output of map phase). The data flow is in this way:
-> two spills is read in a merged way
-> Combiner receives sorted <key, value> pairs
-> after manipulation, Combiner emits output <key, value> pairs
-> the output is directly written in file.out
If Combiner emits unrelated keys, the records in the file.out will not be fully 
sorted. In our interface, Combiner is not allowed to emit key and the output 
key is determined by the input. The sequence of records in file.out will be 
guaranteed.

to be continued... :)

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Reply via email to