[ 
https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833907#action_12833907
 ] 

Doug Cutting commented on MAPREDUCE-326:
----------------------------------------

Chris> If the goal is a faster binary API, then this should use NIO primitives 
[ ... ]

Perhaps this might instead look something like:

{code}
interface RawMapper {
  void map(Split, RawMapOutput);
}
interface RawMapOutput {
  // records are written as contiguous byte ranges here
  WritableByteChannel() getChannel();
  // call with the position of each record in the data written
  void addRecord(long start, long length);
  // utility to help keep track of bytes written
  long getBytesWritten();
}
{code}

The goal is to permit the kernel to identify record boundaries (so that it can 
compare, sort and transmit records) while at the same time minimize per-record 
data copying.  Getting this API right without benchmarking might prove 
difficult.  We should benchmark this under various scenarios: A key/value pair 
of Writable instances, line-based data from a text file, and length-delimited, 
raw binary data.

Chris> Better pipes/streaming workflows are explicitly considered in 
MAPREDUCE-1183; one can imagine an implementation of the MapTask or ReduceTask 
loading its user code in an implementation written in the native language.

Can you please elaborate?  I don't see the words "pipes" or "streaming" 
mentioned in that issue.  How does one load Python, Ruby, C++, etc. into Java?  
MAPREDUCE-1183 seems to me just to be a different way to encapsulate 
configuration data, grouping it per extension point rather than centralizing it 
in the job config.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
>         Attachments: MAPREDUCE-326-api.patch, MAPREDUCE-326.pdf
>
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to 
> use arbitrary types complicate the design and lead to lots of object creates 
> and other overhead that a byte oriented design would not suffer.  I believe 
> the lowest level implementation of hadoop map-reduce should have byte string 
> oriented APIs (for keys and values).  This API would be more performant, 
> simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to