[ 
https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833277#action_12833277
 ] 

Chris Douglas commented on MAPREDUCE-326:
-----------------------------------------

If the goal is a faster binary API, then this should use NIO primitives, not 
byte[] and certainly not DataInputBuffer/DataOutputBuffer. Section 3.3- 
addressing backwards compatibility- actually *introduces a buffer copy*, unless 
the serializers are backed by the output collector (which would make the call 
to {{collect}} redundant). Where the current proposal doesn't add buffer 
copies, it doesn't remove them from the framework either, so how exactly is it 
more efficient?

It would help discussion if Pig, Hive, or Avro had a concrete use case 
demonstrating a clear performance win that _cannot_ be implemented in the 
current API. The design document cites "high level record readers that produce 
objects[...] a needless conversion" but this is simply false. One can read 
directly from a stream into a byte-oriented record and "serialize" that record 
into the MapOutputBuffer by a single buffer copy. If the user wants to 
manipulate bytes directly from a binary reader, the framework does not prevent 
this. The claim that the existing API forces an inefficient data path is 
unsubstantiated.

Making the components effecting the sort and merge available to frameworks and 
advanced users is a far simpler goal and offers every gain the current proposal 
claims. Cleaning up those utility APIs and making them visible does *not* 
require redesigning the core of MapReduce around a new (third!) set of 
user-facing abstractions. I am \-1 on the API proposal, but heartily endorse 
efforts to make the byte-oriented MapReduce utilities (MapOutputBuffer, IFile, 
Merger, etc.) available to frameworks. Marking them as limited private and 
managing the interfaces between them can be done iteratively and transparently 
to users. The maintenance cost in evolving these utilities is *far* lower than 
a new user API.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
>         Attachments: MAPREDUCE-326-api.patch, MAPREDUCE-326.pdf
>
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to 
> use arbitrary types complicate the design and lead to lots of object creates 
> and other overhead that a byte oriented design would not suffer.  I believe 
> the lowest level implementation of hadoop map-reduce should have byte string 
> oriented APIs (for keys and values).  This API would be more performant, 
> simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to