[
https://issues.apache.org/jira/browse/HADOOP-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617901#action_12617901
]
Owen O'Malley commented on HADOOP-1230:
---------------------------------------
{quote}
1. What is the contract for cleanup()? Is is called if map()/reduce() throws an
exception? I think it should be, so Mapper/Reducer#run should call cleanup() in
a finally clause.
{quote}
Currently, it is just:
{code}
public void run(Context context) throws IOException, InterruptedException {
setup(context);
KEYIN key = context.nextKey(null);
VALUEIN value = null;
while (key != null) {
value = context.nextValue(value);
map(key, value, context);
key = context.nextKey(key);
}
cleanup(context);
}
{code}
I thought about it, but it seemed to confuse things more than it helped. I
guess it mostly depends on whether cleanup is used to close file handles, which
should happen, or to process the last record which shouldn't happen. Of course
by overriding the run method, the user can do either. What are other people's
thoughts?
{quote}
2. One of the things that the previous version supported was a flexible way of
handling large value classes. If your value is huge you may not want to
deserialize it into an object, but instead read the byte stream directly. This
isn't apart of this issue, but I think the current approach will support it by
i) adding streaming accessors to the context, ii) overriding the run() method
to pass in a null value, so map()/reduce() implementations get the value byte
stream from the context. (More generally, this might be the approach to support
HADOOP-2429.) Does this sound right?
{quote}
The problem that I have is that it would need to bypass the RecordReader to do
it. If you add to the context
{code}
InputStream getKey() throws IOException;
InputStream getValue() throws IOException;
{code}
you need to add a parallel method in RecordReader to get raw keys. And
presumably the same trick in the RecordWriter for output. On the other hand, a
lazy value class could have a file-backed implementation could work with the
object interface. Am I missing how this would work?
{quote}
3. ReduceContext could be made to implement Iterable<VALUEIN>, to make it
slightly more concise to iterate over the values (for expert use in the run
method). The reduce method would be unchanged.
{quote}
It is a pretty minor improvement of
{code}
for(VALUE v: context)
- versus -
for(VALUE v: context.getValues())
{code}
and means that the ReduceContext needs an iterator() method that is relatively
ambiguous between iterating over keys or values. I think the current explicit
method makes it cleaner.
{quote}
4. Although not a hard requirement, it would be nice to make the user API
serialization agnostic. I think we can make InputSplit not implement Writable,
and use a SerializationFactory to serialize splits.
{quote}
This makes sense.
{quote}
5. Is this a good opportunity to make TextInputFormat extend
FileInputFormat<Text, NullWritable>, like HADOOP-3566?
{quote}
*smile* It probably makes sense, although I'm a little hesitant to break yet
another thing.
{quote}
6. JobContext#getGroupingComparator has javadoc that refers to
WritableComparable, when it should be RawComparable.
{quote}
+1
> Replace parameters with context objects in Mapper, Reducer, Partitioner,
> InputFormat, and OutputFormat classes
> --------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1230
> URL: https://issues.apache.org/jira/browse/HADOOP-1230
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Attachments: context-objs-2.patch, context-objs-3.patch,
> context-objs.patch
>
>
> This is a big change, but it will future-proof our API's. To maintain
> backwards compatibility, I'd suggest that we move over to a new package name
> (org.apache.hadoop.mapreduce) and deprecate the old interfaces and package.
> Basically, it will replace:
> package org.apache.hadoop.mapred;
> public interface Mapper extends JobConfigurable, Closeable {
> void map(WritableComparable key, Writable value, OutputCollector output,
> Reporter reporter) throws IOException;
> }
> with:
> package org.apache.hadoop.mapreduce;
> public interface Mapper extends Closable {
> void map(MapContext context) throws IOException;
> }
> where MapContext has the methods like getKey(), getValue(), collect(Key,
> Value), progress(), etc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.