[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Owen O'Malley (JIRA) Wed, 03 Mar 2010 23:56:51 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841114#action_12841114
 ]


Owen O'Malley commented on MAPREDUCE-1270:
------------------------------------------

{quote}
I don't think we need to completely compatible with pipes API
{quote}
I don't think there is enough motivation to have two different C++ APIs, so you 
should use the same interface. That does *not* mean that you can't change the 
API to be better. You can and should help make the APIs more usable and 
extensible.

{quote}
If we do need a C++ API , we should consider usability and extensibility more 
then compatibility, because I don't realize such compatibility problem is a 
problem for most users .
{quote}
There is a requirement to provide backwards compatibility of all of Hadoop's 
public APIs with the previous version. APIs and interfaces can be deprecated 
and then removed in a later version, but compatibility is not optional.



> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Reply via email to