[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

Binglin Chang (JIRA) Sat, 13 Aug 2011 20:15:57 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084777#comment-13084777
 ]


Binglin Chang commented on MAPREDUCE-1270:
------------------------------------------

Hi, Arun
HCE2.0 is mainly focused on stability(bugfix) and usability
Bugfix: HCE is not very stable right now, although we fix a lot bugs, current 
codebase is a mess:( a lot work need to be done, but currently no time(other 
projects).
Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming 
& python is much popular than java api in Baidu; C++ version of partitioners 
such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile, 
CombineInput.., multiple output; and compression codecs such as lzma, lzo, 
quicklz;
As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added 
yet), we gain another 10-20%, both in Hadoop & upper level application.

About MR-v2
We are keep watching your progress and have read your design doc & some code 
already, looking forward further discussion on this very interesting topic.





> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
> Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
> Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link 
> http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and 
> deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program 
> and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE 
> compared to Java MapRed and Pipes.
> Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

Reply via email to