[jira] Commented: (MAPREDUCE-326) The lowest level map-reduce APIs should be byte oriented

Jay Booth (JIRA) Tue, 16 Feb 2010 11:21:51 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834395#action_12834395
 ]


Jay Booth commented on MAPREDUCE-326:
-------------------------------------

Ok, maybe I got ahead of myself :)

Basically, I see this:
{quote}
public abstract void map(TaskSplitIndex splitIndex,
    RawMapOutputCollector collector, RawMapContext context)
    throws IOException, InterruptedException;
{quote}

as meaning "Your tasks just have to worry about an input split and then 
blasting their output to the framework as bytes" -- it wouldn't be too far a 
leap from there to write a runtime for mapred in other languages -- anyways, 
most salient to me personally was the fact that there'd now be an API level 
where fetching/gathering your InputSplit could be handled at the 
job/framework/mapper level -- if I want to do that now, I have to write a set 
of control files and throw data locality out the window.  The fact that this 
would decouple the lower-level APIs from a specific serialization framework 
would seem to be a win as well, AvroMapReduce or whatever it's called could be 
built right alongside existing WritableMapReduce, which would seem to make more 
sense than building one on top of the other.

If I understand the current proposal correctly, we could have a join where one 
mapper class is pulling a big select statement from a DB, another is crunching 
some big compressed sequence files, and another is pulling in a bunch of tiny 
Hive partitions using CombineFileInput, without them stepping all over each 
other and creating "last one wins" configuration conditions.   This is 
theoretically doable under the current framework but it involves a lot of 
shoehorning, so that's the itch it would be scratching for me.  Making the 
framework serialization agnostic would I think be an even bigger win, Writables 
are clean and light but they're not the be-all and end-all of serialization.

I guess I just see the proposed binary-level framework as a "gateway condition" 
to a whole bunch of wins.  As long as everything is directly tied to Hadoop 
Writables in Java, there's only going to be so far we can go beyond the basic 
wordcount program.  If we have a common, robust, low-level binary API that's 
exposed for all to use, we could rapidly see framework implementations in a few 
langauges, more flexible input methods, different serialization formats, 
non-mapreduce distributed computing ("just distribute these runnables across 
the cluster and tell me when they're done"), etc.  The immediate goal of having 
Avro talk to bytes instead of Avro talks to Writables talk to bytes seems to be 
a decent enough short-term win to justify the work, IMO, especially when you 
consider the long-term flexibility.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
>         Attachments: MAPREDUCE-326-api.patch, MAPREDUCE-326.pdf
>
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to 
> use arbitrary types complicate the design and lead to lots of object creates 
> and other overhead that a byte oriented design would not suffer.  I believe 
> the lowest level implementation of hadoop map-reduce should have byte string 
> oriented APIs (for keys and values).  This API would be more performant, 
> simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-326) The lowest level map-reduce APIs should be byte oriented

Reply via email to