subject:"\[jira\] Commented\: \(MAPREDUCE\-1270\) Hadoop C\+\+ Extention"

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2012-07-03 Thread Dong Yang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405688#comment-13405688
]

Dong Yang commented on MAPREDUCE-1270:
--

Hi, Mikhail, Yihang

I am so sorry I can't post the most recent / stable version of HCE for
download, some limitations frustrate me.

Now we redirect HCE to MAPREDUCE-2841 (Task level native optimization), which
is the new implementation base HCE, and provides higher performance imporvement.

We will contribute to MAPREDUCE-2841 continuously, please watch this jira~

Thanks,
Dong

Hadoop C++ Extention

Key: MAPREDUCE-1270
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Affects Versions: 0.20.1
Environment: hadoop linux
Reporter: Wang Shouyan
Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE
Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++
Extension.doc

Hadoop C++ extension is an internal project in baidu, We start it for these
reasons:
1 To provide C++ API. We mostly use Streaming before, and we also try to
use PIPES, but we do not find PIPES is more efficient than Streaming. So we
think a new C++ extention is needed for us.
2 Even using PIPES or Streaming, it is hard to control memory of hadoop
map/reduce Child JVM.
3 It costs so much to read/write/sort TB/PB data by Java. When using
PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do:
1 We do not use map/reduce Child JVM to do any data processing, which just
prepares environment, starts C++ mapper, tells mapper which split it should
deal with, and reads report from mapper until that finished. The mapper will
read record, ivoke user defined map, to do partition, write spill, combine
and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it
read from sorted files, ivoke user difined reduce, and write to user defined
record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and
memory control.
at first, 1 and 2, then 3.
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and
management, but everything in execution.
*UPDATE:*
Now you can get a test version of HCE from this link
http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFkhl=zh_CNpli=1
This is a full package with all hadoop source code.
Following document HCE InstallMenu.pdf in attachment, you will build and
deploy it in your cluster.
Attachment HCE Tutorial.pdf will lead you to write the first HCE program
and give other specifications of the interface.
Attachment HCE Performance Report.pdf gives a performance report of HCE
compared to Java MapRed and Pipes.
Any comments are welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2012-07-01 Thread Mikhail Bautin (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404883#comment-13404883
]

Mikhail Bautin commented on MAPREDUCE-1270:
---

Hello HCE Developers,

Would it be possible to post the most recent / stable version of HCE for
download? It would be even better if you could continuously push your HCE code
changes to e.g. a github repository.

Thanks,
Mikhail

Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-08-13 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084768#comment-13084768
]

Arun C Murthy commented on MAPREDUCE-1270:
--

With MAPREDUCE-279, we can now support alternate runtimes for MapReduce - do
you guys want to take a look and see if we can integrate more closely? The Java
layer might be completely unnecessary now...

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-08-13 Thread Binglin Chang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084777#comment-13084777
]

Binglin Chang commented on MAPREDUCE-1270:
--

Hi, Arun
HCE2.0 is mainly focused on stability(bugfix) and usability
Bugfix: HCE is not very stable right now, although we fix a lot bugs, current
codebase is a mess:( a lot work need to be done, but currently no time(other
projects).
Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming
python is much popular than java api in Baidu; C++ version of partitioners
such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile,
CombineInput.., multiple output; and compression codecs such as lzma, lzo,
quicklz;
As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added
yet), we gain another 10-20%, both in Hadoop upper level application.

About MR-v2
We are keep watching your progress and have read your design doc some code
already, looking forward further discussion on this very interesting topic.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-29 Thread Binglin Chang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040800#comment-13040800
]

Binglin Chang commented on MAPREDUCE-1270:
--

Koth, In HCE socket is only used for passing control messages(not like c++
pipes), which has little impact on performance, as for data processing, such as
input/map/mid-output/reduce/output, since everything is implemented in C++, JNI
is not needed, except reading input from HDFS and writing output to HDFS, HCE
uses libhdfs, which is JNI based.
I think JNI based C++ extension for MR have the advantage of non-intrusive, and
has better compatibility. In current HCE design, we need to reimplement many
features already exists in Java, some of those get performance benefit(sort,
spill), some of those are purely duplicate work.
In current HCE design, if you wan't performance benefits in HCE, the only way
is to use HCE interface, my thought is to extract the high performance
part(sort, spill, compression in MapOutputCollector), wrap it using JNI as
native lib like compress codecs, a jobconf item is used to enable/disable
native optimization, so the code is compatible and java based jobs can also get
performance benefits.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-04 Thread koth chen (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13028748#comment-13028748
]

koth chen commented on MAPREDUCE-1270:
--

I don't think pipes based map/reduce task will performance better than JNI
based!! Why you guys think socket communication will be better than JNI method
call!
I've written a JNI based framework for C++ Map/Reduce Task,and porting the
Hbase's HFile to my framework for input/output format. It works great!

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

2011-05-04 Thread eric baldeschwieler (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13028750#comment-13028750
]

eric baldeschwieler commented on MAPREDUCE-1270:

Hi Folks,

I'm back part-time, but I'm mainly focused on catching up and adjusting to life
with a newborn at home.

Peter Cnudde is currently head up Hadoop service delivery.

Most line issues can continue to go to Amol, Satish, Avik or Senthil as
appropriate.

I am about, drop me a line on my personal email or call my cell if you need
rapid response, but I am reading mail now.

Thanks,
E14

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Dong Yang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891544#action_12891544
]

Dong Yang commented on MAPREDUCE-1270:
--

Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which
includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075.

The steps for using this patch is as follows:
1. Download HADOOP-HCE-1.0.0.patch
2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk
trunk-963075;
3. cd trunk-963075;
4. patch -p0 HADOOP-HCE-1.0.0.patch
5. sh build.sh (need java, forrest and ant)

HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh
we first check out hdfs trunk and build it.

Hadoop C++ Extention

Key: MAPREDUCE-1270
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Affects Versions: 0.20.1
Environment: hadoop linux
Reporter: Wang Shouyan
Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE
Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891596#action_12891596
]

Allen Wittenauer commented on MAPREDUCE-1270:
-

This patch appears to contain code from the C++ Boost library. Someone needs to
do the legwork to determine the legality of the patch.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-07-23 Thread Doug Cutting (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891662#action_12891662
]

Doug Cutting commented on MAPREDUCE-1270:
-

Looks like BSD:

http://www.boost.org/LICENSE_1_0.txt

So we'd just need to append it to LICENSE.txt, noting there which files are
under this license.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-17 Thread Wang Shouyan (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879707#action_12879707
]

Wang Shouyan commented on MAPREDUCE-1270:
-

Posting entire tarballs is just for trial, we will deploy it in our
production environment first , and later provide a patch for trunk.

Hadoop C++ Extention

Key: MAPREDUCE-1270
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Affects Versions: 0.20.1
Environment: hadoop linux
Reporter: Wang Shouyan
Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE
Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-15 Thread Owen O'Malley (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879055#action_12879055
]

Owen O'Malley commented on MAPREDUCE-1270:
--

Posting entire tarballs isn't very useful. Can you include your changes as a
patch?

Hadoop C++ Extention

Key: MAPREDUCE-1270
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Affects Versions: 0.20.1
Environment: hadoop linux
Reporter: Wang Shouyan
Attachments: HCE InstallMenu.pdf, HCE Performance Report.pdf, HCE
Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-06-09 Thread zhang.pengfei (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876988#action_12876988
]

zhang.pengfei commented on MAPREDUCE-1270:
--

Woo!~ sounds so cool!

now you want to opensource it ?

come on

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-04 Thread Owen O'Malley (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841468#action_12841468
]

Owen O'Malley commented on MAPREDUCE-1270:
--

By the way, here is an archive of the message that I sent back in Nov 07
comparing the performance of Java, pipes, and streaming.

http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02961.html

Especially by reimplementing the sort and shuffle, you should be able to get
much faster than Java. *smile*

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Fusheng Han (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840666#action_12840666
]

Fusheng Han commented on MAPREDUCE-1270:

Arun, I appreciate your comments.

The bad news is that our design document is written in Chinese. My team members
and I will put some design details step by step in the next few days.

For Q3, we indeed change the interface of Combiner, while the semantics for
Combiner is the same with Java Map-Reduce. It prevents mistaken use of
Combiner. In the situation that two spills with sorted records will merge into
file.out (the output of map phase). The data flow is in this way:
- two spills is read in a merged way
- Combiner receives sorted key, value pairs
- after manipulation, Combiner emits output key, value pairs
- the output is directly written in file.out
If Combiner emits unrelated keys, the records in the file.out will not be fully
sorted. In our interface, Combiner is not allowed to emit key and the output
key is determined by the input. The sequence of records in file.out will be
guaranteed.

to be continued... :)

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Luke Lu (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840722#action_12840722
]

Luke Lu commented on MAPREDUCE-1270:

Fusheng, feel free to attach the design doc if there is nothing confidential in
it and Shouyan approves :). There are plenty of people on the thread who
understand Chinese. It'd help me explaining some details to Arun, now that I
work next to him.

On the combiner interface, I think it'd be better to add an emitValue
convenient method instead of changing the interface, as there are quite a few
legit uses.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840795#action_12840795
]

Arun C Murthy commented on MAPREDUCE-1270:
--

bq. The bad news is that our design document is written in Chinese. My team
members and I will put some design details step by step in the next few days.

Thanks!

bq. For Q3, we indeed change the interface of Combiner, while the semantics for
Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner.

It's a reasonable argument, but I'd recommend we stay compatible with both Java
Map-Reduce and Pipes by having the same interface. FYI: both Java and Pipes
explicitly disallow changing of keys in the combiner in the 'contract'. If the
user does go ahead and change the key the application is not guaranteed to work.

In terms of apis, as I previously mentioned I stronly recommend you start using
the Hadoop Pipes apis and enhance it - this will ensure compatibility between
Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to
Hadoop Pipes as I recommended previously.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Hong Tang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840797#action_12840797
]

Hong Tang commented on MAPREDUCE-1270:
--

bq. The bad news is that our design document is written in Chinese. My team
members and I will put some design details step by step in the next few days.

There are many hadoop devs fluent in Chinese, so it might still be a good idea
to share the original design doc.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Wang Shouyan (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841070#action_12841070
]

Wang Shouyan commented on MAPREDUCE-1270:
-

In terms of apis, as I previously mentioned I stronly recommend you start
using the Hadoop Pipes apis and enhance it - this will ensure compatibility
between Hadoop Pipes and HCE - again, please consider moving the
sort/shuffle/merge to Hadoop Pipes as I recommended previously.

I do not agree with this opinion, if we need to establish standards of c++
API, I don't think we need to completely compatible with pipes API， because I
don't think pipes API is carefully considerated, may be for compatibility of
some other code, but never been discussed adequately。

If we do need a C++ API , we should consider usability and extensibility more
then compatibility, because I don't realize such compatibility problem is a
problem for most users .

If for usability and extensibility, any suggestion is welcome.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-03 Thread Owen O'Malley (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841114#action_12841114
]

Owen O'Malley commented on MAPREDUCE-1270:
--

{quote}
I don't think we need to completely compatible with pipes API
{quote}
I don't think there is enough motivation to have two different C++ APIs, so you
should use the same interface. That does *not* mean that you can't change the
API to be better. You can and should help make the APIs more usable and
extensible.

{quote}
If we do need a C++ API , we should consider usability and extensibility more
then compatibility, because I don't realize such compatibility problem is a
problem for most users .
{quote}
There is a requirement to provide backwards compatibility of all of Hadoop's
public APIs with the previous version. APIs and interfaces can be deprecated
and then removed in a later version, but compatibility is not optional.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-02 Thread Fusheng Han (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840166#action_12840166
 ] 

Fusheng Han commented on MAPREDUCE-1270:


This project is undergoing inside Baidu. The basic functions have completed. We 
get the HCE(Hadoop C++ Extension) run fluently with Text input and without any 
compression. About 20 percent improvement has achieved compared to Streaming. 
40GB input and 5 nodes are used in this experiment. And MapReduce application 
is wordcounter.

The interfaces exposed to users are similar with PIPES. Mapper interface is 
class Mapper {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t map(MapInput input) = 0;

protected:
  virtual void emit(const void* key, const int64_t keyLength,
const void* value, const int64_t valueLength) {
getContext()-emit(key, keyLength, value, valueLength);
  }
  virtual TaskContext* getContext() {
return context;
  }
};
Modeled after new hadoop MapReduce interface, setup() and cleanup() functions 
are added here. MapInput is a new defined type for map input. Key and value can 
be retrieved from this object. An emit() function is provided here, which can 
be invoked directly in map() function. Types of key and value are all raw 
memory pointer followed by corresponding length. This is better for non-text 
manipulation.

The Reducer is same with Mapper:
class Reducer {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t reduce(ReduceInput input) = 0;
  
protected:
  virtual void emit(const void* key, const int64_t keyLength,
const void* value, const int64_t valueLength) {
getContext()-emit(key, keyLength, value, valueLength);
  } 
  virtual TaskContext* getContext() {
return context;
  }
};
A slightly difference is that ReduceInput can get iterative values with next() 
function.

In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here 
we get a little change that Combiner can only emit value (no key parameter in 
emit function). The consideration that omitting key from emit pair of combine 
function is due to mistaken keys may corrupt the order of the map output. The 
output key of emit() funtion is determined by the input.
class Combiner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup(bool isSuccessful) {return 0;}
  virtual int64_t combine(ReduceInput input) = 0;
  
protected:
  virtual void emit(const void* value, const int64_t valueLength) {
getContext()-emit(getCombineKey(), getCombineKeyLength(), value, 
valueLength);
  } 
  virtual TaskContext* getContext() {
return context;
  } 
  virtual const void* getCombineKey() {
return combineKey;
  }
  virtual int64_t getCombineKeyLength() {
return combineKeyLength;
  }
};

The Partitioner also gets setup() and cleanup() functions:
class Partitioner {
public:
  virtual int64_t setup() {return 0;}
  virtual int64_t cleanup() {return 0;}
  virtual int partition(const void* key, const int64_t keyLength, int 
numOfReduces) = 0;
};

Following pipes, we add a new entry with the name HCE in hadoop command. 
Users run command like hadoop hce XXX to invoke HCE MapReduce.

We'd like to hear your comments.


 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-02 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840277#action_12840277
]

Arun C Murthy commented on MAPREDUCE-1270:
--

Fusheng, this is interesting.

Could you please put up a design document? There are several pieces I'm
interested in understanding better:
# Changes to the framework JobTracker/TaskTracker for e.g. changes to TaskRunner
# Implications to job-submission, serialization of job-conf etc. from a C++
job-client etc.
# I do not understand why you are changing semantics for Combiner, this is
incompatible with Java Map-Reduce.
# I'd expect one to implement a C++ 'context object' for mappers, reducers etc.
I don't see this in your api at all?

I'm sure I'll have more comments once I see more details.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-03-02 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840340#action_12840340
]

Arun C Murthy commented on MAPREDUCE-1270:
--

Fusheng, thinking about this a bit more I have a suggestion to help push this
through the hadoop framework in a more straight-forward manner and help this
get committed:

I'd propose you guys take existing hadoop pipes, keep _all_ of its apis and
implement the map-side sort, shuffle and reduce-side merge within pipes itself
i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark
the 'C++ data-path' as experimental and co-exist with current functionality,
thus it will be far easier to get more experience with this.

Currently pipes allows one to implement a C++ RecordReader for the map and a
C++ RecordWriter for the reduce. We can enhance pipes to collect the
map-output, sort it in C++ and write out the IFile and index for the
map-output. The reduces would do the shuffle, merge 'reduce' call in C++ and
use the existing infrastructure for the C++ recordwriter to write the outputs.

A note of caution: You will need to worry about TaskCompletionEvents i.e.
events which let the reduces know the identity and location of completed maps,
currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for
this information - and this might be a sticky bit. As an intermediate step, one
possible way around is to change ReduceTask.java to relay the
TaskCompletionEvents from the java Child to the C++ reducer.

In terms of development, you could start developing on a svn branch of hadoop
pipes.

Thoughts?

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-01-27 Thread He Yongqiang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805457#action_12805457
]

He Yongqiang commented on MAPREDUCE-1270:
-

Hi Dong / Shouyan,
Are you going to open source this? If yes, can you update the recent work? This
can help others to better understand.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-23 Thread Zheng Shao (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794299#action_12794299
]

Zheng Shao commented on MAPREDUCE-1270:
---

Any progress on this?

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-07 Thread Todd Lipcon (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786814#action_12786814
]

Todd Lipcon commented on MAPREDUCE-1270:

This is pretty interesting. How are you implementing TaskUmbilicalProtocol?

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-07 Thread Dong Yang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786953#action_12786953
]

Dong Yang commented on MAPREDUCE-1270:
--

1. Child JVM Process is reserved, which is used for setting up runtime
enviroment, starting C++ process, and in charge of contacting with hadoop,
excluding data R/W logic.
2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
3. C++ process can only accept command, deal with data, and report states,
which is not concerned with scheduling and exception handling.

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-07 Thread Dong Yang (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786952#action_12786952
]

Dong Yang commented on MAPREDUCE-1270:
--

Hadoop C++ Extention

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

28 matches

Site Navigation

Mail list logo

Footer information