[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084777#comment-13084777 ]
Binglin Chang commented on MAPREDUCE-1270: ------------------------------------------ Hi, Arun HCE2.0 is mainly focused on stability(bugfix) and usability Bugfix: HCE is not very stable right now, although we fix a lot bugs, current codebase is a mess:( a lot work need to be done, but currently no time(other projects). Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming & python is much popular than java api in Baidu; C++ version of partitioners such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile, CombineInput.., multiple output; and compression codecs such as lzma, lzo, quicklz; As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added yet), we gain another 10-20%, both in Hadoop & upper level application. About MR-v2 We are keep watching your progress and have read your design doc & some code already, looking forward further discussion on this very interesting topic. > Hadoop C++ Extention > -------------------- > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Affects Versions: 0.20.1 > Environment: hadoop linux > Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: > 1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. > 2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. > 3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. > What we want to do: > 1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. > 2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. > 3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. > at first, 1 and 2, then 3. > What's the difference with PIPES: > 1 Yes, We will reuse most PIPES code. > 2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira