[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041757#comment-14041757
]
Arun C Murthy commented on MAPREDUCE-2841:
------------------------------------------
On related thought to Pig/Hive etc. - I see Hadoop MapReduce fading away fast
particularly since projects using MR such as Pig, Hive, Cascading etc.
re-vector on other projects like Apache Tez or Apache Spark.
For e.g.
# Hive-on-Tez (https://issues.apache.org/jira/browse/HIVE-4660) - The hive
community has already moved it's major investments away from MR to Tez.
# Pig-on-Tez (https://issues.apache.org/jira/browse/PIG-3446) - The pig
community is very close to shipping this in pig-0.14 and again is investing
heavily on Tez.
Given that, Sean/Todd, would it be useful to discuss contributing this to Tez
instead?
This way the work here would continue to stay relevant in the context of the
majority users of MapReduce who use Pig, Hive, Cascading etc.
Of course, I'm sure another option is Apache Spark, but given that Tez is much
more closer (code-base wise) to MR, it would be much easier to contribute to
Tez. Happy to help if that makes sense too. Thanks.
> Task level native optimization
> ------------------------------
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Environment: x86-64 Linux/Unix
> Reporter: Binglin Chang
> Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch,
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch,
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI.
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs
> emitted by mapper, therefore sort, spill, IFile serialization can all be done
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is
> supported, and I have not think through many things right now, such as how to
> support map side combine. I had some discussion with somebody familiar with
> hive, it seems that these limitations won't be much problem for Hive to
> benefit from those optimizations, at least. Advices or discussions about
> improving compatibility are most welcome:)
> Currently NativeMapOutputCollector has a static method called canEnable(),
> which checks if key/value type, comparator type, combiner are all compatible,
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better
> final results, and I believe similar optimization can be adopt to reduce task
> and shuffle too.
--
This message was sent by Atlassian JIRA
(v6.2#6252)