[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405688#comment-13405688 ] Dong Yang commented on MAPREDUCE-1270: -- Hi, Mikhail, Yihang I am so sorry I can't post the most recent / stable version of HCE for download, some limitations frustrate me. Now we redirect HCE to MAPREDUCE-2841 (Task level native optimization), which is the new implementation base HCE, and provides higher performance imporvement. We will contribute to MAPREDUCE-2841 continuously, please watch this jira~ Thanks, Dong > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-2446) HCE 2.0
HCE 2.0 --- Key: MAPREDUCE-2446 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2446 Project: Hadoop Map/Reduce Issue Type: Improvement Components: contrib/streaming, pipes, task Reporter: Dong Yang Enhancing MapReduce by Task-level Optimization. Except for yielding speedups of up to 130% on original Streaming Program, Hce 2.0 provides some more flexible programming interfaces including c++, java, python, etc. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Yang updated MAPREDUCE-1270: - Attachment: HADOOP-HCE-1.0.0.patch HCE-1.0.0.patch for mapreduce trunk (revision 963075) > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891544#action_12891544 ] Dong Yang commented on MAPREDUCE-1270: -- Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075. The steps for using this patch is as follows: 1. Download HADOOP-HCE-1.0.0.patch 2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk trunk-963075; 3. cd trunk-963075; 4. patch -p0 < HADOOP-HCE-1.0.0.patch 5. sh build.sh (need java, forrest and ant) HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh we first check out hdfs trunk and build it. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE > Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Yang updated MAPREDUCE-1270: - Attachment: Overall Design of Hadoop C++ Extension.doc Hadoop C++ Extension (HCE for short) is a framework for making mapreduce more stable and faster. Here is the overall design of HCE, welcome to give your viewpoints on its practical implementation. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: Overall Design of Hadoop C++ Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786952#action_12786952 ] Dong Yang commented on MAPREDUCE-1270: -- 1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic. 2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout. 3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786953#action_12786953 ] Dong Yang commented on MAPREDUCE-1270: -- 1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic. 2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout. 3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.