[
https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829838#action_12829838
]
Doug Cutting commented on MAPREDUCE-326:
----------------------------------------
> 1. keys compress better separately than intermixed with values (not that we
> do that currently...)
Sure, if you know something about the structure then you can usually compress
better, but key/value is only one kind of structure. An Avro-specific
compressor could compress records as columns. If we want to enable
structure-specific compression in the shuffle, then we should perhaps make that
pluggable.
> 2. combining the keys and values will effectively block using memcmp for the
> sort
Memcmp-friendly keys and values are a data format. If a job uses memcmp for
comparison, then it could include a key-length of each datum, perhaps as a
varint tacked on the front. If a job uses Avro for comparison, then it needs
to include a schema, but distinguishing keys and values is not required. So,
depending on how comparison is done, the requirements are different. As I said
above, comparators might be a pluggable part of the kernel, like device drivers
in the linux kernel.
> 3. We could support more flexible memory management if the values can be
> distinguished from the keys.
I don't see how this would work in practice.
> 4. Since Sequence Files and T Files contain both Keys and Values, they would
> need to be wrapped together to present them as a single object.
That would be done as a part of the Java compatibility API. Their
serializations could just be appended, I think.
> I don't see what this abstraction is buying you over using ByteBuffer and a
> Serializer that knows how to use it.
You're suggesting that, instead of adding a new low-level API, we should add a
new high-level mapreduce API based on BytesWritable? We'd, e.g., serialize
Avro data into a BytesWritable key, use NullWritable for the value. The
comparator could get the schema from the job. Then we'd have a new
Avro-specific high-level framework for mapreduce, with its own Mapper and
Reducer interfaces. The Avro mapreduce framework would serialize map outputs
to a BytesWritable, then pass them on through the existing mapreduce API. This
adds another layer of buffering, but, other than that, it could work. We could
even make it generic to arbitrary serialization engines, not just Avro. Is
this what you'd prefer mapreduce applications that use Avro do?
My primary goal here are making steps towards language independence. It is
unfortunate that the lowest-level MapReduce APIs are built around high-level,
language-specific concepts like Java classes and generics. Rather we might
work to hone the mapreduce kernel to a minimal, language-independent core, then
add a variety of higher-level APIs in different languages.
> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
> Key: MAPREDUCE-326
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: eric baldeschwieler
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to
> use arbitrary types complicate the design and lead to lots of object creates
> and other overhead that a byte oriented design would not suffer. I believe
> the lowest level implementation of hadoop map-reduce should have byte string
> oriented APIs (for keys and values). This API would be more performant,
> simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.