[ 
https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829838#action_12829838
 ] 

Doug Cutting commented on MAPREDUCE-326:
----------------------------------------

> 1. keys compress better separately than intermixed with values (not that we 
> do that currently...)

Sure, if you know something about the structure then you can usually compress 
better, but key/value is only one kind of structure.  An Avro-specific 
compressor could compress records as columns.  If we want to enable 
structure-specific compression in the shuffle, then we should perhaps make that 
pluggable.

> 2. combining the keys and values will effectively block using memcmp for the 
> sort

Memcmp-friendly keys and values are a data format.  If a job uses memcmp for 
comparison, then it could include a key-length of each datum, perhaps as a 
varint tacked on the front.  If a job uses Avro for comparison, then it needs 
to include a schema, but distinguishing keys and values is not required.  So, 
depending on how comparison is done, the requirements are different.  As I said 
above, comparators might be a pluggable part of the kernel, like device drivers 
in the linux kernel.

> 3. We could support more flexible memory management if the values can be 
> distinguished from the keys.

I don't see how this would work in practice.

> 4. Since Sequence Files and T Files contain both Keys and Values, they would 
> need to be wrapped together to present them as a single object.

That would be done as a part of the Java compatibility API.  Their 
serializations could just be appended, I think.

> I don't see what this abstraction is buying you over using ByteBuffer and a 
> Serializer that knows how to use it.

You're suggesting that, instead of adding a new low-level API, we should add a 
new high-level mapreduce API based on BytesWritable?  We'd, e.g., serialize 
Avro data into a BytesWritable key, use NullWritable for the value.  The 
comparator could get the schema from the job.  Then we'd have a new 
Avro-specific high-level framework for mapreduce, with its own Mapper and 
Reducer interfaces.  The Avro mapreduce framework would serialize map outputs 
to a BytesWritable, then pass them on through the existing mapreduce API.  This 
adds another layer of buffering, but, other than that, it could work.  We could 
even make it generic to arbitrary serialization engines, not just Avro.  Is 
this what you'd prefer mapreduce applications that use Avro do?

My primary goal here are making steps towards language independence.  It is 
unfortunate that the lowest-level MapReduce APIs  are built around high-level, 
language-specific concepts like Java classes and generics.  Rather we might 
work to hone the mapreduce kernel to a minimal, language-independent core, then 
add a variety of higher-level APIs in different languages.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to 
> use arbitrary types complicate the design and lead to lots of object creates 
> and other overhead that a byte oriented design would not suffer.  I believe 
> the lowest level implementation of hadoop map-reduce should have byte string 
> oriented APIs (for keys and values).  This API would be more performant, 
> simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to