[ https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829838#action_12829838 ]
Doug Cutting commented on MAPREDUCE-326: ---------------------------------------- > 1. keys compress better separately than intermixed with values (not that we > do that currently...) Sure, if you know something about the structure then you can usually compress better, but key/value is only one kind of structure. An Avro-specific compressor could compress records as columns. If we want to enable structure-specific compression in the shuffle, then we should perhaps make that pluggable. > 2. combining the keys and values will effectively block using memcmp for the > sort Memcmp-friendly keys and values are a data format. If a job uses memcmp for comparison, then it could include a key-length of each datum, perhaps as a varint tacked on the front. If a job uses Avro for comparison, then it needs to include a schema, but distinguishing keys and values is not required. So, depending on how comparison is done, the requirements are different. As I said above, comparators might be a pluggable part of the kernel, like device drivers in the linux kernel. > 3. We could support more flexible memory management if the values can be > distinguished from the keys. I don't see how this would work in practice. > 4. Since Sequence Files and T Files contain both Keys and Values, they would > need to be wrapped together to present them as a single object. That would be done as a part of the Java compatibility API. Their serializations could just be appended, I think. > I don't see what this abstraction is buying you over using ByteBuffer and a > Serializer that knows how to use it. You're suggesting that, instead of adding a new low-level API, we should add a new high-level mapreduce API based on BytesWritable? We'd, e.g., serialize Avro data into a BytesWritable key, use NullWritable for the value. The comparator could get the schema from the job. Then we'd have a new Avro-specific high-level framework for mapreduce, with its own Mapper and Reducer interfaces. The Avro mapreduce framework would serialize map outputs to a BytesWritable, then pass them on through the existing mapreduce API. This adds another layer of buffering, but, other than that, it could work. We could even make it generic to arbitrary serialization engines, not just Avro. Is this what you'd prefer mapreduce applications that use Avro do? My primary goal here are making steps towards language independence. It is unfortunate that the lowest-level MapReduce APIs are built around high-level, language-specific concepts like Java classes and generics. Rather we might work to hone the mapreduce kernel to a minimal, language-independent core, then add a variety of higher-level APIs in different languages. > The lowest level map-reduce APIs should be byte oriented > -------------------------------------------------------- > > Key: MAPREDUCE-326 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-326 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: eric baldeschwieler > > As discussed here: > https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237 > The templates, serializers and other complexities that allow map-reduce to > use arbitrary types complicate the design and lead to lots of object creates > and other overhead that a byte oriented design would not suffer. I believe > the lowest level implementation of hadoop map-reduce should have byte string > oriented APIs (for keys and values). This API would be more performant, > simpler and more easily cross language. > The existing API could be maintained as a thin layer on top of the leaner API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.