[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2012-07-03 Thread Dong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405688#comment-13405688
 ] 

Dong Yang commented on MAPREDUCE-1270:
--

Hi, Mikhail, Yihang

I am so sorry I can't post the most recent / stable version of HCE for 
download, some limitations frustrate me.

Now we redirect HCE to MAPREDUCE-2841 (Task level native optimization), which 
is the new implementation base HCE, and provides higher performance imporvement.

We will contribute to MAPREDUCE-2841 continuously, please watch this jira~

Thanks,
Dong

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2012-07-01 Thread Mikhail Bautin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404883#comment-13404883
 ] 

Mikhail Bautin commented on MAPREDUCE-1270:
---

Hello HCE Developers,

Would it be possible to post the most recent / stable version of HCE for 
download? It would be even better if you could continuously push your HCE code 
changes to e.g. a github repository.

Thanks,
Mikhail


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-08-13 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084768#comment-13084768
 ] 

Arun C Murthy commented on MAPREDUCE-1270:
--

With MAPREDUCE-279, we can now support alternate runtimes for MapReduce - do 
you guys want to take a look and see if we can integrate more closely? The Java 
layer might be completely unnecessary now...

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-08-13 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084777#comment-13084777
 ] 

Binglin Chang commented on MAPREDUCE-1270:
--

Hi, Arun
HCE2.0 is mainly focused on stability(bugfix) and usability
Bugfix: HCE is not very stable right now, although we fix a lot bugs, current 
codebase is a mess:( a lot work need to be done, but currently no time(other 
projects).
Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming 
 python is much popular than java api in Baidu; C++ version of partitioners 
such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile, 
CombineInput.., multiple output; and compression codecs such as lzma, lzo, 
quicklz;
As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added 
yet), we gain another 10-20%, both in Hadoop  upper level application.

About MR-v2
We are keep watching your progress and have read your design doc  some code 
already, looking forward further discussion on this very interesting topic.





 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-29 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040800#comment-13040800
 ] 

Binglin Chang commented on MAPREDUCE-1270:
--

Koth, In HCE socket is only used for passing control messages(not like c++ 
pipes), which has little impact on performance, as for data processing, such as 
input/map/mid-output/reduce/output, since everything is implemented in C++, JNI 
is not needed, except reading input from HDFS and writing output to HDFS, HCE 
uses libhdfs, which is JNI based.
I think JNI based C++ extension for MR have the advantage of non-intrusive, and 
has better compatibility. In current HCE design, we need to reimplement many 
features already exists in Java, some of those get performance benefit(sort, 
spill), some of those are purely duplicate work. 
In current HCE design, if you wan't performance benefits in HCE, the only way 
is to use HCE interface, my thought is to extract the high performance 
part(sort, spill, compression in MapOutputCollector), wrap it using JNI as 
native lib like compress codecs, a jobconf item is used to enable/disable 
native optimization, so the code is compatible and java based jobs can also get 
performance benefits.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-04 Thread koth chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13028748#comment-13028748
 ] 

koth chen commented on MAPREDUCE-1270:
--

I don't think pipes based map/reduce task will performance better than JNI 
based!! Why you guys think socket communication  will be better than JNI method 
call!
I've written a JNI based framework for C++ Map/Reduce Task,and porting the 
Hbase's HFile to my framework for input/output format. It works great!


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-04 Thread eric baldeschwieler (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13028750#comment-13028750
 ] 

eric baldeschwieler commented on MAPREDUCE-1270:


Hi Folks,

I'm back part-time, but I'm mainly focused on catching up and adjusting to life 
with a newborn at home.

Peter Cnudde is currently head up Hadoop service delivery.

Most line issues can continue to go to Amol, Satish, Avik or Senthil as 
appropriate.

I am about, drop me a line on my personal email or call my cell if you need 
rapid response, but I am reading mail now.

Thanks,
E14


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Dong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891544#action_12891544
 ] 

Dong Yang commented on MAPREDUCE-1270:
--

Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which 
includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075.

The steps for using this patch is as follows:
1. Download HADOOP-HCE-1.0.0.patch
2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk 
trunk-963075; 
3. cd trunk-963075; 
4. patch -p0  HADOOP-HCE-1.0.0.patch
5. sh build.sh (need java, forrest and ant)

HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh 
we first check out hdfs trunk and build it.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE 
 Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891596#action_12891596
 ] 

Allen Wittenauer commented on MAPREDUCE-1270:
-

This patch appears to contain code from the C++ Boost library. Someone needs to 
do the legwork to determine the legality of the patch.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891662#action_12891662
 ] 

Doug Cutting commented on MAPREDUCE-1270:
-

Looks like BSD:

http://www.boost.org/LICENSE_1_0.txt

So we'd just need to append it to LICENSE.txt, noting there which files are 
under this license.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-17 Thread Wang Shouyan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879707#action_12879707
 ] 

Wang Shouyan commented on MAPREDUCE-1270:
-

Posting entire tarballs is just  for trial,  we will deploy it in our 
production environment  first , and later provide a patch for trunk.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE 
 Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-15 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879055#action_12879055
 ] 

Owen O'Malley commented on MAPREDUCE-1270:
--

Posting entire tarballs isn't very useful. Can you include your changes as a 
patch?

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE 
 Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-09 Thread zhang.pengfei (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876988#action_12876988
 ] 

zhang.pengfei commented on MAPREDUCE-1270:
--

Woo!~  sounds so cool!

now you want to opensource it ?

come on 

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-04 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841468#action_12841468
 ] 

Owen O'Malley commented on MAPREDUCE-1270:
--

By the way, here is an archive of the message that I sent back in Nov 07 
comparing the performance of Java, pipes, and streaming.

http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02961.html

Especially by reimplementing the sort and shuffle, you should be able to get 
much faster than Java. *smile*

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Fusheng Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840666#action_12840666
 ] 

Fusheng Han commented on MAPREDUCE-1270:


Arun, I appreciate your comments.

The bad news is that our design document is written in Chinese. My team members 
and I will put some design details step by step in the next few days.

For Q3, we indeed change the interface of Combiner, while the semantics for 
Combiner is the same with Java Map-Reduce. It prevents mistaken use of 
Combiner. In the situation that two spills with sorted records will merge into 
file.out (the output of map phase). The data flow is in this way:
- two spills is read in a merged way
- Combiner receives sorted key, value pairs
- after manipulation, Combiner emits output key, value pairs
- the output is directly written in file.out
If Combiner emits unrelated keys, the records in the file.out will not be fully 
sorted. In our interface, Combiner is not allowed to emit key and the output 
key is determined by the input. The sequence of records in file.out will be 
guaranteed.

to be continued... :)

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Luke Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840722#action_12840722
 ] 

Luke Lu commented on MAPREDUCE-1270:


Fusheng, feel free to attach the design doc if there is nothing confidential in 
it and Shouyan approves :). There are plenty of people on the thread who 
understand Chinese. It'd help me explaining some details to Arun, now that I 
work next to him.

On the combiner interface, I think it'd be better to add an emitValue 
convenient method instead of changing the interface, as there are quite a few 
legit uses.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840795#action_12840795
 ] 

Arun C Murthy commented on MAPREDUCE-1270:
--

bq. The bad news is that our design document is written in Chinese. My team 
members and I will put some design details step by step in the next few days.

Thanks!

bq. For Q3, we indeed change the interface of Combiner, while the semantics for 
Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner.

It's a reasonable argument, but I'd recommend we stay compatible with both Java 
Map-Reduce and Pipes by having the same interface. FYI: both Java and Pipes 
explicitly disallow changing of keys in the combiner in the 'contract'. If the 
user does go ahead and change the key the application is not guaranteed to work.



In terms of apis, as I previously mentioned I stronly recommend you start using 
the Hadoop Pipes apis and enhance it - this will ensure compatibility between 
Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to 
Hadoop Pipes as I recommended previously.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840797#action_12840797
 ] 

Hong Tang commented on MAPREDUCE-1270:
--

bq. The bad news is that our design document is written in Chinese. My team 
members and I will put some design details step by step in the next few days.

There are many hadoop devs fluent in Chinese, so it might still be a good idea 
to share the original design doc.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Wang Shouyan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841070#action_12841070
 ] 

Wang Shouyan commented on MAPREDUCE-1270:
-

In terms of apis, as I previously mentioned I stronly recommend you start 
using the Hadoop Pipes apis and enhance it - this will ensure compatibility 
between Hadoop Pipes and HCE - again, please consider moving the 
sort/shuffle/merge to Hadoop Pipes as I recommended previously.

I do not agree with this opinion,  if we  need to establish standards of c++ 
API, I don't think we need to completely compatible with pipes API,  because I 
don't think  pipes API is carefully considerated,   may be for compatibility of 
some other code, but never been  discussed  adequately。

If we do need a  C++ API , we should consider usability and extensibility more 
then compatibility,  because I don't  realize  such compatibility problem is a 
problem for most users .

If for usability and extensibility, any  suggestion is welcome.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841114#action_12841114
 ] 

Owen O'Malley commented on MAPREDUCE-1270:
--

{quote}
I don't think we need to completely compatible with pipes API
{quote}
I don't think there is enough motivation to have two different C++ APIs, so you 
should use the same interface. That does *not* mean that you can't change the 
API to be better. You can and should help make the APIs more usable and 
extensible.

{quote}
If we do need a C++ API , we should consider usability and extensibility more 
then compatibility, because I don't realize such compatibility problem is a 
problem for most users .
{quote}
There is a requirement to provide backwards compatibility of all of Hadoop's 
public APIs with the previous version. APIs and interfaces can be deprecated 
and then removed in a later version, but compatibility is not optional.



 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-02 Thread Fusheng Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840166#action_12840166
 ] 

Fusheng Han commented on MAPREDUCE-1270:


This project is undergoing inside Baidu. The basic functions have completed. We 
get the HCE(Hadoop C++ Extension) run fluently with Text input and without any 
compression. About 20 percent improvement has achieved compared to Streaming. 
40GB input and 5 nodes are used in this experiment. And MapReduce application 
is wordcounter.

The interfaces exposed to users are similar with PIPES. Mapper interface is 
class Mapper {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t map(MapInput input) = 0;

protected:
  virtual void emit(const void* key, const int64_t keyLength,
const void* value, const int64_t valueLength) {
getContext()-emit(key, keyLength, value, valueLength);
  }
  virtual TaskContext* getContext() {
return context;
  }
};
Modeled after new hadoop MapReduce interface, setup() and cleanup() functions 
are added here. MapInput is a new defined type for map input. Key and value can 
be retrieved from this object. An emit() function is provided here, which can 
be invoked directly in map() function. Types of key and value are all raw 
memory pointer followed by corresponding length. This is better for non-text 
manipulation.

The Reducer is same with Mapper:
class Reducer {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t reduce(ReduceInput input) = 0;
  
protected:
  virtual void emit(const void* key, const int64_t keyLength,
const void* value, const int64_t valueLength) {
getContext()-emit(key, keyLength, value, valueLength);
  } 
  virtual TaskContext* getContext() {
return context;
  }
};
A slightly difference is that ReduceInput can get iterative values with next() 
function.

In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here 
we get a little change that Combiner can only emit value (no key parameter in 
emit function). The consideration that omitting key from emit pair of combine 
function is due to mistaken keys may corrupt the order of the map output. The 
output key of emit() funtion is determined by the input.
class Combiner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t combine(ReduceInput input) = 0;
  
protected:
  virtual void emit(const void* value, const int64_t valueLength) {
getContext()-emit(getCombineKey(), getCombineKeyLength(), value, 
valueLength);
  } 
  virtual TaskContext* getContext() {
return context;
  } 
  virtual const void* getCombineKey() {
return combineKey;
  }
  virtual int64_t getCombineKeyLength() {
return combineKeyLength;
  }
};

The Partitioner also gets setup() and cleanup() functions:
class Partitioner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup() {return 0;}
  virtual int partition(const void* key, const int64_t keyLength, int 
numOfReduces) = 0;
};

Following pipes, we add a new entry with the name HCE in hadoop command. 
Users run command like hadoop hce XXX to invoke HCE MapReduce.

We'd like to hear your comments.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-02 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840277#action_12840277
 ] 

Arun C Murthy commented on MAPREDUCE-1270:
--

Fusheng, this is interesting.

Could you please put up a design document? There are several pieces I'm 
interested in understanding better:
# Changes to the framework JobTracker/TaskTracker for e.g. changes to TaskRunner
# Implications to job-submission, serialization of job-conf etc. from a C++ 
job-client etc.
# I do not understand why you are changing semantics for Combiner, this is 
incompatible with Java Map-Reduce.
# I'd expect one to implement a C++ 'context object' for mappers, reducers etc. 
I don't see this in your api at all?

I'm sure I'll have more comments once I see more details.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-02 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840340#action_12840340
 ] 

Arun C Murthy commented on MAPREDUCE-1270:
--

Fusheng, thinking about this a bit more I have a suggestion to help push this 
through the hadoop framework in a more straight-forward manner and help this 
get committed:

I'd propose you guys take existing hadoop pipes, keep _all_ of its apis and 
implement the map-side sort, shuffle and reduce-side merge within pipes itself 
i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark 
the 'C++ data-path' as experimental and co-exist with current functionality, 
thus it will be far easier to get more experience with this.

Currently pipes allows one to implement a C++ RecordReader for the map and a 
C++ RecordWriter for the reduce. We can enhance pipes to collect the 
map-output, sort it in C++ and write out the IFile and index for the 
map-output. The reduces would do the shuffle, merge  'reduce' call in C++ and 
use the existing infrastructure for the C++ recordwriter to write the outputs.

A note of caution: You will need to worry about TaskCompletionEvents i.e. 
events which let the reduces know the identity and location of completed maps, 
currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for 
this information - and this might be a sticky bit. As an intermediate step, one 
possible way around is to change ReduceTask.java to relay the 
TaskCompletionEvents from the java Child to the C++ reducer.

In terms of development, you could start developing on a svn branch of hadoop 
pipes.

Thoughts?

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-01-27 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805457#action_12805457
 ] 

He Yongqiang commented on MAPREDUCE-1270:
-

Hi Dong / Shouyan,
Are you going to open source this? If yes, can you update the recent work? This 
can help others to better understand.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-23 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794299#action_12794299
 ] 

Zheng Shao commented on MAPREDUCE-1270:
---

Any progress on this?

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-07 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786814#action_12786814
 ] 

Todd Lipcon commented on MAPREDUCE-1270:


This is pretty interesting. How are you implementing TaskUmbilicalProtocol?

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-07 Thread Dong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786953#action_12786953
 ] 

Dong Yang commented on MAPREDUCE-1270:
--

1. Child JVM Process is reserved, which is used for setting up runtime 
enviroment, starting C++ process, and in charge of contacting with hadoop, 
excluding data R/W logic.
2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
3. C++ process can only accept command, deal with data, and report states, 
which is not concerned with scheduling and exception handling.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-07 Thread Dong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786952#action_12786952
 ] 

Dong Yang commented on MAPREDUCE-1270:
--

1. Child JVM Process is reserved, which is used for setting up runtime 
enviroment, starting C++ process, and in charge of contacting with hadoop, 
excluding data R/W logic.
2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
3. C++ process can only accept command, deal with data, and report states, 
which is not concerned with scheduling and exception handling.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.