[
https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834395#action_12834395
]
Jay Booth commented on MAPREDUCE-326:
-------------------------------------
Ok, maybe I got ahead of myself :)
Basically, I see this:
{quote}
public abstract void map(TaskSplitIndex splitIndex,
RawMapOutputCollector collector, RawMapContext context)
throws IOException, InterruptedException;
{quote}
as meaning "Your tasks just have to worry about an input split and then
blasting their output to the framework as bytes" -- it wouldn't be too far a
leap from there to write a runtime for mapred in other languages -- anyways,
most salient to me personally was the fact that there'd now be an API level
where fetching/gathering your InputSplit could be handled at the
job/framework/mapper level -- if I want to do that now, I have to write a set
of control files and throw data locality out the window. The fact that this
would decouple the lower-level APIs from a specific serialization framework
would seem to be a win as well, AvroMapReduce or whatever it's called could be
built right alongside existing WritableMapReduce, which would seem to make more
sense than building one on top of the other.
If I understand the current proposal correctly, we could have a join where one
mapper class is pulling a big select statement from a DB, another is crunching
some big compressed sequence files, and another is pulling in a bunch of tiny
Hive partitions using CombineFileInput, without them stepping all over each
other and creating "last one wins" configuration conditions. This is
theoretically doable under the current framework but it involves a lot of
shoehorning, so that's the itch it would be scratching for me. Making the
framework serialization agnostic would I think be an even bigger win, Writables
are clean and light but they're not the be-all and end-all of serialization.
I guess I just see the proposed binary-level framework as a "gateway condition"
to a whole bunch of wins. As long as everything is directly tied to Hadoop
Writables in Java, there's only going to be so far we can go beyond the basic
wordcount program. If we have a common, robust, low-level binary API that's
exposed for all to use, we could rapidly see framework implementations in a few
langauges, more flexible input methods, different serialization formats,
non-mapreduce distributed computing ("just distribute these runnables across
the cluster and tell me when they're done"), etc. The immediate goal of having
Avro talk to bytes instead of Avro talks to Writables talk to bytes seems to be
a decent enough short-term win to justify the work, IMO, especially when you
consider the long-term flexibility.
> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
> Key: MAPREDUCE-326
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: eric baldeschwieler
> Attachments: MAPREDUCE-326-api.patch, MAPREDUCE-326.pdf
>
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to
> use arbitrary types complicate the design and lead to lots of object creates
> and other overhead that a byte oriented design would not suffer. I believe
> the lowest level implementation of hadoop map-reduce should have byte string
> oriented APIs (for keys and values). This API would be more performant,
> simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.