[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405688#comment-13405688 ] Dong Yang commented on MAPREDUCE-1270: -- Hi, Mikhail, Yihang I am so sorry I can't post the most recent / stable version of HCE for download, some limitations frustrate me. Now we redirect HCE to MAPREDUCE-2841 (Task level native optimization), which is the new implementation base HCE, and provides higher performance imporvement. We will contribute to MAPREDUCE-2841 continuously, please watch this jira~ Thanks, Dong > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404883#comment-13404883 ] Mikhail Bautin commented on MAPREDUCE-1270: --- Hello HCE Developers, Would it be possible to post the most recent / stable version of HCE for download? It would be even better if you could continuously push your HCE code changes to e.g. a github repository. Thanks, Mikhail > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123646#comment-13123646 ] linyihang commented on MAPREDUCE-1270: -- Hello Mr Yang, I am new to HCE,when I download the HCE 1.0 and try to build it by "./build.sh",but failed.there is the errors: "../hadoop_hce_v1/hadoop-0.20.3/../java6/jre/bin/java: 1: Syntax error: "(" unexpected ../hadoop_hce_v1/hadoop-0.20.3/../java6/jre/bin/java: 1: Syntax error: "(" unexpected " Then I modify the build.sh by comment the lines below, " # prepare the java and ant ENV #export JAVA_HOME=${workdir}/../java6 #export ANT_HOME=${workdir}/../ant #export PATH=${JAVA_HOME}/bin:${ANT_HOME}/bin:$PATH",for my having install jdk1.6.0_21 and the ant1.8.2; But there seems to be some errors like " [exec] /usr/include/linux/tcp.h:77: error: ‘__u32 __fswab32(__u32)’ cannot appear in a constant-expression ",and never meet "BUIED SUCCESSFULY" as the InstallMenu.pdf showing.My OS is UBUNTU 10.10 ,and GCC is version 4.4.5. Is there any error I hava make? Something further,then I try to fix the error as someguy strikes me on Google by replacing "#include " with "#include" in the "../hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Commo n/Type.hh",also add "#include" in the Type.hh. But there takes place to be a lot of mistakes(which I thought) like mistaking "printf("%lld")" by "printf("lld")",and one serious error as follow witch worry me a lot. the serious error is , " [exec] then mv -f ".deps/CompressionFactory.Tpo" ".deps/CompressionFactory.Po"; else rm -f ".deps/CompressionFactory.Tpo"; exit 1; fi [exec] In file included from /usr/include/limits.h:153, [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/limits.h:122, [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/syslimits.h:7, [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/limits.h:11, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/../../../../nativelib/lzo/lzo/lzoconf.h:52, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/../../../../nativelib/lzo/lzo/lzo1.h:45, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/LzoCompressor.hh:23, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/LzoCodec.hh:27, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/CompressionFactory.cc:23: [exec] /usr/include/bits/xopen_lim.h:95: error: missing binary operator before token "(" [exec] /usr/include/bits/xopen_lim.h:98: error: missing binary operator before token "(" [exec] /usr/include/bits/xopen_lim.h:122: error: missing binary operator before token "(" [exec] make[1]:正在离开目录 `/home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/build/c++-build/Linux-i386-32/hce/impl/Compress' [exec] make[1]: *** [CompressionFactory.o] 错误 1 [exec] make: *** [install-recursive] 错误 1 " How can I fix this error? > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished,
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084777#comment-13084777 ] Binglin Chang commented on MAPREDUCE-1270: -- Hi, Arun HCE2.0 is mainly focused on stability(bugfix) and usability Bugfix: HCE is not very stable right now, although we fix a lot bugs, current codebase is a mess:( a lot work need to be done, but currently no time(other projects). Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming & python is much popular than java api in Baidu; C++ version of partitioners such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile, CombineInput.., multiple output; and compression codecs such as lzma, lzo, quicklz; As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added yet), we gain another 10-20%, both in Hadoop & upper level application. About MR-v2 We are keep watching your progress and have read your design doc & some code already, looking forward further discussion on this very interesting topic. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084768#comment-13084768 ] Arun C Murthy commented on MAPREDUCE-1270: -- With MAPREDUCE-279, we can now support alternate runtimes for MapReduce - do you guys want to take a look and see if we can integrate more closely? The Java layer might be completely unnecessary now... > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084767#comment-13084767 ] Arun C Murthy commented on MAPREDUCE-1270: -- Can someone please help me understand the relationship between this jira and MAPREDUCE-2446? > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040800#comment-13040800 ] Binglin Chang commented on MAPREDUCE-1270: -- Koth, In HCE socket is only used for passing control messages(not like c++ pipes), which has little impact on performance, as for data processing, such as input/map/mid-output/reduce/output, since everything is implemented in C++, JNI is not needed, except reading input from HDFS and writing output to HDFS, HCE uses libhdfs, which is JNI based. I think JNI based C++ extension for MR have the advantage of non-intrusive, and has better compatibility. In current HCE design, we need to reimplement many features already exists in Java, some of those get performance benefit(sort, spill), some of those are purely duplicate work. In current HCE design, if you wan't performance benefits in HCE, the only way is to use HCE interface, my thought is to extract the high performance part(sort, spill, compression in MapOutputCollector), wrap it using JNI as native lib like compress codecs, a jobconf item is used to enable/disable native optimization, so the code is compatible and java based jobs can also get performance benefits. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028750#comment-13028750 ] eric baldeschwieler commented on MAPREDUCE-1270: Hi Folks, I'm back part-time, but I'm mainly focused on catching up and adjusting to life with a newborn at home. Peter Cnudde is currently head up Hadoop service delivery. Most line issues can continue to go to Amol, Satish, Avik or Senthil as appropriate. I am about, drop me a line on my personal email or call my cell if you need rapid response, but I am reading mail now. Thanks, E14 > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028748#comment-13028748 ] koth chen commented on MAPREDUCE-1270: -- I don't think pipes based map/reduce task will performance better than JNI based!! Why you guys think socket communication will be better than JNI method call! I've written a JNI based framework for C++ Map/Reduce Task,and porting the Hbase's HFile to my framework for input/output format. It works great! > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891662#action_12891662 ] Doug Cutting commented on MAPREDUCE-1270: - Looks like BSD: http://www.boost.org/LICENSE_1_0.txt So we'd just need to append it to LICENSE.txt, noting there which files are under this license. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891596#action_12891596 ] Allen Wittenauer commented on MAPREDUCE-1270: - This patch appears to contain code from the C++ Boost library. Someone needs to do the legwork to determine the legality of the patch. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE > Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ > Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891544#action_12891544 ] Dong Yang commented on MAPREDUCE-1270: -- Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075. The steps for using this patch is as follows: 1. Download HADOOP-HCE-1.0.0.patch 2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk trunk-963075; 3. cd trunk-963075; 4. patch -p0 < HADOOP-HCE-1.0.0.patch 5. sh build.sh (need java, forrest and ant) HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh we first check out hdfs trunk and build it. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE > Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879707#action_12879707 ] Wang Shouyan commented on MAPREDUCE-1270: - Posting entire tarballs is just for trial, we will deploy it in our production environment first , and later provide a patch for trunk. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE > Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879055#action_12879055 ] Owen O'Malley commented on MAPREDUCE-1270: -- Posting entire tarballs isn't very useful. Can you include your changes as a patch? > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE > Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. > *UPDATE:* > Now you can get a test version of HCE from this link > http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1 > This is a full package with all hadoop source code. > Following document "HCE InstallMenu.pdf" in attachment, you will build and > deploy it in your cluster. > Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program > and give other specifications of the interface. > Attachment "HCE Performance Report.pdf" gives a performance report of HCE > compared to Java MapRed and Pipes. > Any comments are welcomed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876988#action_12876988 ] zhang.pengfei commented on MAPREDUCE-1270: -- Woo!~ sounds so cool! now you want to opensource it ? come on > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > Attachments: Overall Design of Hadoop C++ Extension.doc > > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841468#action_12841468 ] Owen O'Malley commented on MAPREDUCE-1270: -- By the way, here is an archive of the message that I sent back in Nov 07 comparing the performance of Java, pipes, and streaming. http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02961.html Especially by reimplementing the sort and shuffle, you should be able to get much faster than Java. *smile* > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841114#action_12841114 ] Owen O'Malley commented on MAPREDUCE-1270: -- {quote} I don't think we need to completely compatible with pipes API {quote} I don't think there is enough motivation to have two different C++ APIs, so you should use the same interface. That does *not* mean that you can't change the API to be better. You can and should help make the APIs more usable and extensible. {quote} If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users . {quote} There is a requirement to provide backwards compatibility of all of Hadoop's public APIs with the previous version. APIs and interfaces can be deprecated and then removed in a later version, but compatibility is not optional. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841070#action_12841070 ] Wang Shouyan commented on MAPREDUCE-1270: - "In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously." I do not agree with this opinion, if we need to establish standards of c++ API, I don't think we need to completely compatible with pipes API, because I don't think pipes API is carefully considerated, may be for compatibility of some other code, but never been discussed adequately。 If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users . If for usability and extensibility, any suggestion is welcome. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840797#action_12840797 ] Hong Tang commented on MAPREDUCE-1270: -- bq. The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days. There are many hadoop devs fluent in Chinese, so it might still be a good idea to share the original design doc. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840795#action_12840795 ] Arun C Murthy commented on MAPREDUCE-1270: -- bq. The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days. Thanks! bq. For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner. It's a reasonable argument, but I'd recommend we stay compatible with both Java Map-Reduce and Pipes by having the same interface. FYI: both Java and Pipes explicitly disallow changing of keys in the combiner in the 'contract'. If the user does go ahead and change the key the application is not guaranteed to work. In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840722#action_12840722 ] Luke Lu commented on MAPREDUCE-1270: Fusheng, feel free to attach the design doc if there is nothing confidential in it and Shouyan approves :). There are plenty of people on the thread who understand Chinese. It'd help me explaining some details to Arun, now that I work next to him. On the combiner interface, I think it'd be better to add an emitValue convenient method instead of changing the interface, as there are quite a few legit uses. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840666#action_12840666 ] Fusheng Han commented on MAPREDUCE-1270: Arun, I appreciate your comments. The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days. For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner. In the situation that two spills with sorted records will merge into file.out (the output of map phase). The data flow is in this way: -> two spills is read in a merged way -> Combiner receives sorted pairs -> after manipulation, Combiner emits output pairs -> the output is directly written in file.out If Combiner emits unrelated keys, the records in the file.out will not be fully sorted. In our interface, Combiner is not allowed to emit key and the output key is determined by the input. The sequence of records in file.out will be guaranteed. to be continued... :) > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840340#action_12840340 ] Arun C Murthy commented on MAPREDUCE-1270: -- Fusheng, thinking about this a bit more I have a suggestion to help push this through the hadoop framework in a more straight-forward manner and help this get committed: I'd propose you guys take existing hadoop pipes, keep _all_ of its apis and implement the map-side sort, shuffle and reduce-side merge within pipes itself i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark the 'C++ data-path' as experimental and co-exist with current functionality, thus it will be far easier to get more experience with this. Currently pipes allows one to implement a C++ RecordReader for the map and a C++ RecordWriter for the reduce. We can enhance pipes to collect the map-output, sort it in C++ and write out the IFile and index for the map-output. The reduces would do the shuffle, merge & 'reduce' call in C++ and use the existing infrastructure for the C++ recordwriter to write the outputs. A note of caution: You will need to worry about TaskCompletionEvents i.e. events which let the reduces know the identity and location of completed maps, currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for this information - and this might be a sticky bit. As an intermediate step, one possible way around is to change ReduceTask.java to relay the TaskCompletionEvents from the java Child to the C++ reducer. In terms of development, you could start developing on a svn branch of hadoop pipes. Thoughts? > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840277#action_12840277 ] Arun C Murthy commented on MAPREDUCE-1270: -- Fusheng, this is interesting. Could you please put up a design document? There are several pieces I'm interested in understanding better: # Changes to the framework JobTracker/TaskTracker for e.g. changes to TaskRunner # Implications to job-submission, serialization of job-conf etc. from a C++ job-client etc. # I do not understand why you are changing semantics for Combiner, this is incompatible with Java Map-Reduce. # I'd expect one to implement a C++ 'context object' for mappers, reducers etc. I don't see this in your api at all? I'm sure I'll have more comments once I see more details. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840166#action_12840166 ] Fusheng Han commented on MAPREDUCE-1270: This project is undergoing inside Baidu. The basic functions have completed. We get the HCE(Hadoop C++ Extension) run fluently with Text input and without any compression. About 20 percent improvement has achieved compared to Streaming. 40GB input and 5 nodes are used in this experiment. And MapReduce application is wordcounter. The interfaces exposed to users are similar with PIPES. Mapper interface is class Mapper { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup(bool isSuccessful) {return 0;} virtual int64_t map(MapInput &input) = 0; protected: virtual void emit(const void* key, const int64_t keyLength, const void* value, const int64_t valueLength) { getContext()->emit(key, keyLength, value, valueLength); } virtual TaskContext* getContext() { return context; } }; Modeled after new hadoop MapReduce interface, setup() and cleanup() functions are added here. MapInput is a new defined type for map input. Key and value can be retrieved from this object. An emit() function is provided here, which can be invoked directly in map() function. Types of key and value are all raw memory pointer followed by corresponding length. This is better for non-text manipulation. The Reducer is same with Mapper: class Reducer { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup(bool isSuccessful) {return 0;} virtual int64_t reduce(ReduceInput &input) = 0; protected: virtual void emit(const void* key, const int64_t keyLength, const void* value, const int64_t valueLength) { getContext()->emit(key, keyLength, value, valueLength); } virtual TaskContext* getContext() { return context; } }; A slightly difference is that ReduceInput can get iterative values with next() function. In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here we get a little change that Combiner can only emit value (no key parameter in emit function). The consideration that omitting key from emit pair of combine function is due to mistaken keys may corrupt the order of the map output. The output key of emit() funtion is determined by the input. class Combiner { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup(bool isSuccessful) {return 0;} virtual int64_t combine(ReduceInput &input) = 0; protected: virtual void emit(const void* value, const int64_t valueLength) { getContext()->emit(getCombineKey(), getCombineKeyLength(), value, valueLength); } virtual TaskContext* getContext() { return context; } virtual const void* getCombineKey() { return combineKey; } virtual int64_t getCombineKeyLength() { return combineKeyLength; } }; The Partitioner also gets setup() and cleanup() functions: class Partitioner { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup() {return 0;} virtual int partition(const void* key, const int64_t keyLength, int numOfReduces) = 0; }; Following pipes, we add a new entry with the name "HCE" in hadoop command. Users run command like "hadoop hce XXX" to invoke HCE MapReduce. We'd like to hear your comments. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle an
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805457#action_12805457 ] He Yongqiang commented on MAPREDUCE-1270: - Hi Dong / Shouyan, Are you going to open source this? If yes, can you update the recent work? This can help others to better understand. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794299#action_12794299 ] Zheng Shao commented on MAPREDUCE-1270: --- Any progress on this? > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786952#action_12786952 ] Dong Yang commented on MAPREDUCE-1270: -- 1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic. 2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout. 3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786953#action_12786953 ] Dong Yang commented on MAPREDUCE-1270: -- 1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic. 2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout. 3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling. > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
[ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786814#action_12786814 ] Todd Lipcon commented on MAPREDUCE-1270: This is pretty interesting. How are you implementing TaskUmbilicalProtocol? > Hadoop C++ Extention > > > Key: MAPREDUCE-1270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.1 > Environment: hadoop linux >Reporter: Wang Shouyan > > Hadoop C++ extension is an internal project in baidu, We start it for these > reasons: >1 To provide C++ API. We mostly use Streaming before, and we also try to > use PIPES, but we do not find PIPES is more efficient than Streaming. So we > think a new C++ extention is needed for us. >2 Even using PIPES or Streaming, it is hard to control memory of hadoop > map/reduce Child JVM. >3 It costs so much to read/write/sort TB/PB data by Java. When using > PIPES or Streaming, pipe or socket is not efficient to carry so huge data. >What we want to do: >1 We do not use map/reduce Child JVM to do any data processing, which just > prepares environment, starts C++ mapper, tells mapper which split it should > deal with, and reads report from mapper until that finished. The mapper will > read record, ivoke user defined map, to do partition, write spill, combine > and merge into file.out. We think these operations can be done by C++ code. >2 Reducer is similar to mapper, it was started after sort finished, it > read from sorted files, ivoke user difined reduce, and write to user defined > record writer. >3 We also intend to rewrite shuffle and sort with C++, for efficience and > memory control. >at first, 1 and 2, then 3. >What's the difference with PIPES: >1 Yes, We will reuse most PIPES code. >2 And, We should do it more completely, nothing changed in scheduling and > management, but everything in execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.