[jira] [Updated] (MAPREDUCE-1270) Hadoop C++ Extention

2016-04-13 Thread luoxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoxu  updated MAPREDUCE-1270:
--
Affects Version/s: 2.6.2

> Hadoop C++ Extention
> 
>
> Key: MAPREDUCE-1270
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
>Affects Versions: 0.20.1
> Environment:  hadoop linux
>Reporter: Wang Shouyan
> Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
> Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
> Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>What we want to do: 
>1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>at first, 1 and 2, then 3.  
>What's the difference with PIPES:
>1 Yes, We will reuse most PIPES code.
>2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link 
> http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk=zh_CN=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and 
> deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program 
> and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE 
> compared to Java MapRed and Pipes.
> Any comments are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-1270) Hadoop C++ Extention

2016-04-13 Thread luoxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoxu  updated MAPREDUCE-1270:
--
Affects Version/s: (was: 2.6.2)

> Hadoop C++ Extention
> 
>
> Key: MAPREDUCE-1270
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
>Affects Versions: 0.20.1
> Environment:  hadoop linux
>Reporter: Wang Shouyan
> Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
> Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
> Extension.doc
>
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>What we want to do: 
>1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>at first, 1 and 2, then 3.  
>What's the difference with PIPES:
>1 Yes, We will reuse most PIPES code.
>2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link 
> http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk=zh_CN=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and 
> deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program 
> and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE 
> compared to Java MapRed and Pipes.
> Any comments are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-04 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated MAPREDUCE-1270:
-

Comment: was deleted

(was: Hi Folks,

I'm back part-time, but I'm mainly focused on catching up and adjusting to life 
with a newborn at home.

Peter Cnudde is currently head up Hadoop service delivery.

Most line issues can continue to go to Amol, Satish, Avik or Senthil as 
appropriate.

I am about, drop me a line on my personal email or call my cell if you need 
rapid response, but I am reading mail now.

Thanks,
E14
)

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Dong Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Yang updated MAPREDUCE-1270:
-

Attachment: HADOOP-HCE-1.0.0.patch

HCE-1.0.0.patch for mapreduce trunk (revision 963075)

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE 
 Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++ 
 Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.
 *UPDATE:*
 Now you can get a test version of HCE from this link 
 http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
 This is a full package with all hadoop source code.
 Following document HCE InstallMenu.pdf in attachment, you will build and 
 deploy it in your cluster.
 Attachment HCE Tutorial.pdf will lead you to write the first HCE program 
 and give other specifications of the interface.
 Attachment HCE Performance Report.pdf gives a performance report of HCE 
 compared to Java MapRed and Pipes.
 Any comments are welcomed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-12 Thread Fusheng Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fusheng Han updated MAPREDUCE-1270:
---

Attachment: HCE Performance Report.pdf
HCE Tutorial.pdf
HCE InstallMenu.pdf

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE 
 Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-12 Thread Fusheng Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fusheng Han updated MAPREDUCE-1270:
---

Description: 
  Hadoop C++ extension is an internal project in baidu, We start it for these 
reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to 
use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES 
or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just 
prepares environment, starts C++ mapper, tells mapper which split it should  
deal with, and reads report from mapper until that finished. The mapper will 
read record, ivoke user defined map, to do partition, write spill, combine and 
merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read 
from sorted files, ivoke user difined reduce, and write to user defined record 
writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and 
memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and 
management, but everything in execution.

*UPDATE:*

Now you can get a test version of HCE from this link 
http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
This is a full package with all hadoop source code.
Following document HCE InstallMenu.pdf in attachment, you will build and 
deploy it in your cluster.

Attachment HCE Tutorial.pdf will lead you to write the first HCE program and 
give other specifications of the interface.

Attachment HCE Performance Report.pdf gives a performance report of HCE 
compared to Java MapRed and Pipes.

  was:
  Hadoop C++ extension is an internal project in baidu, We start it for these 
reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to 
use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES 
or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just 
prepares environment, starts C++ mapper, tells mapper which split it should  
deal with, and reads report from mapper until that finished. The mapper will 
read record, ivoke user defined map, to do partition, write spill, combine and 
merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read 
from sorted files, ivoke user difined reduce, and write to user defined record 
writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and 
memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and 
management, but everything in execution.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE 
 Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user 

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-12 Thread Fusheng Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fusheng Han updated MAPREDUCE-1270:
---

Description: 
  Hadoop C++ extension is an internal project in baidu, We start it for these 
reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to 
use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES 
or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just 
prepares environment, starts C++ mapper, tells mapper which split it should  
deal with, and reads report from mapper until that finished. The mapper will 
read record, ivoke user defined map, to do partition, write spill, combine and 
merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read 
from sorted files, ivoke user difined reduce, and write to user defined record 
writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and 
memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and 
management, but everything in execution.

*UPDATE:*

Now you can get a test version of HCE from this link 
http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
This is a full package with all hadoop source code.
Following document HCE InstallMenu.pdf in attachment, you will build and 
deploy it in your cluster.

Attachment HCE Tutorial.pdf will lead you to write the first HCE program and 
give other specifications of the interface.

Attachment HCE Performance Report.pdf gives a performance report of HCE 
compared to Java MapRed and Pipes.

Any comments are welcomed.

  was:
  Hadoop C++ extension is an internal project in baidu, We start it for these 
reasons:
   1  To provide C++ API. We mostly use Streaming before, and we also try to 
use PIPES, but we do not find PIPES is more efficient than Streaming. So we 

think a new C++ extention is needed for us.
   2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
map/reduce Child JVM.
   3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES 
or Streaming, pipe or socket is not efficient to carry so huge data.

   What we want to do: 
   1 We do not use map/reduce Child JVM to do any data processing, which just 
prepares environment, starts C++ mapper, tells mapper which split it should  
deal with, and reads report from mapper until that finished. The mapper will 
read record, ivoke user defined map, to do partition, write spill, combine and 
merge into file.out. We think these operations can be done by C++ code.
   2 Reducer is similar to mapper, it was started after sort finished, it read 
from sorted files, ivoke user difined reduce, and write to user defined record 
writer.
   3 We also intend to rewrite shuffle and sort with C++, for efficience and 
memory control.
   at first, 1 and 2, then 3.  

   What's the difference with PIPES:
   1 Yes, We will reuse most PIPES code.
   2 And, We should do it more completely, nothing changed in scheduling and 
management, but everything in execution.

*UPDATE:*

Now you can get a test version of HCE from this link 
http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
This is a full package with all hadoop source code.
Following document HCE InstallMenu.pdf in attachment, you will build and 
deploy it in your cluster.

Attachment HCE Tutorial.pdf will lead you to write the first HCE program and 
give other specifications of the interface.

Attachment HCE Performance Report.pdf gives a performance report of HCE 
compared to Java MapRed and Pipes.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE 
 Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 

[jira] Updated: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-14 Thread Dong Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Yang updated MAPREDUCE-1270:
-

Attachment: Overall Design of Hadoop C++ Extension.doc

Hadoop C++ Extension (HCE for short) is a framework for making mapreduce more 
stable and faster.
Here is the overall design of HCE, welcome to give your viewpoints on its 
practical implementation.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan
 Attachments: Overall Design of Hadoop C++ Extension.doc


   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.