[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840340#action_12840340 ]
Arun C Murthy commented on MAPREDUCE-1270: ------------------------------------------ Fusheng, thinking about this a bit more I have a suggestion to help push this through the hadoop framework in a more straight-forward manner and help this get committed: I'd propose you guys take existing hadoop pipes, keep _all_ of its apis and implement the map-side sort, shuffle and reduce-side merge within pipes itself i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark the 'C++ data-path' as experimental and co-exist with current functionality, thus it will be far easier to get more experience with this. Currently pipes allows one to implement a C++ RecordReader for the map and a C++ RecordWriter for the reduce. We can enhance pipes to collect the map-output, sort it in C++ and write out the IFile and index for the map-output. The reduces would do the shuffle, merge & 'reduce' call in C++ and use the existing infrastructure for the C++ recordwriter to write the outputs. A note of caution: You will need to worry about TaskCompletionEvents i.e. events which let the reduces know the identity and location of completed maps, currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for this information - and this might be a sticky bit. As an intermediate step, one possible way around is to change ReduceTask.java to relay the TaskCompletionEvents from the java Child to the C++ reducer. In terms of development, you could start developing on a svn branch of hadoop pipes. Thoughts? > Hadoop C++ Extention > -------------------- > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Affects Versions: 0.20.1 > Environment: hadoop linux > Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: > 1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. > 2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. > 3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. > What we want to do: > 1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. > 2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. > 3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. > at first, 1 and 2, then 3. > What's the difference with PIPES: > 1 Yes, We will reuse most PIPES code. > 2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.