[jira] Commented: (HADOOP-1230) Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes

Owen O'Malley (JIRA) Tue, 29 Jul 2008 11:38:23 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617901#action_12617901
 ]


Owen O'Malley commented on HADOOP-1230:
---------------------------------------

{quote}
1. What is the contract for cleanup()? Is is called if map()/reduce() throws an 
exception? I think it should be, so Mapper/Reducer#run should call cleanup() in 
a finally clause.
{quote}

Currently, it is just:
{code}
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    KEYIN key = context.nextKey(null);
    VALUEIN value = null;
    while (key != null) {
      value = context.nextValue(value);
      map(key, value, context);
      key = context.nextKey(key);
    }
    cleanup(context);
  }
{code}

I thought about it, but it seemed to confuse things more than it helped. I 
guess it mostly depends on whether cleanup is used to close file handles, which 
should happen, or to process the last record which shouldn't happen. Of course 
by overriding the run method, the user can do either. What are other people's 
thoughts?

{quote}
2. One of the things that the previous version supported was a flexible way of 
handling large value classes. If your value is huge you may not want to 
deserialize it into an object, but instead read the byte stream directly. This 
isn't apart of this issue, but I think the current approach will support it by 
i) adding streaming accessors to the context, ii) overriding the run() method 
to pass in a null value, so map()/reduce() implementations get the value byte 
stream from the context. (More generally, this might be the approach to support 
HADOOP-2429.) Does this sound right?
{quote}

The problem that I have is that it would need to bypass the RecordReader to do 
it. If you add to the context
{code}
InputStream getKey() throws IOException;
InputStream getValue() throws IOException;
{code}
you need to add a parallel method in RecordReader to get raw keys. And 
presumably the same trick in the RecordWriter for output. On the other hand, a 
lazy value class could have a file-backed implementation could work with the 
object interface. Am I missing how this would work?

{quote}
3. ReduceContext could be made to implement Iterable<VALUEIN>, to make it 
slightly more concise to iterate over the values (for expert use in the run 
method). The reduce method would be unchanged.
{quote}

It is a pretty minor improvement of
{code}
for(VALUE v: context)
- versus -
for(VALUE v: context.getValues())
{code}
and means that the ReduceContext needs an iterator() method that is relatively 
ambiguous between iterating over keys or values. I think the current explicit 
method makes it cleaner.

{quote}
4. Although not a hard requirement, it would be nice to make the user API 
serialization agnostic. I think we can make InputSplit not implement Writable, 
and use a SerializationFactory to serialize splits.
{quote}

This makes sense.

{quote}
5. Is this a good opportunity to make TextInputFormat extend 
FileInputFormat<Text, NullWritable>, like HADOOP-3566?
{quote}

*smile* It probably makes sense, although I'm a little hesitant to break yet 
another thing.

{quote}
6. JobContext#getGroupingComparator has javadoc that refers to 
WritableComparable, when it should be RawComparable.
{quote}

+1

> Replace parameters with context objects in Mapper, Reducer, Partitioner, 
> InputFormat, and OutputFormat classes
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1230
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1230
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: context-objs-2.patch, context-objs-3.patch, 
> context-objs.patch
>
>
> This is a big change, but it will future-proof our API's. To maintain 
> backwards compatibility, I'd suggest that we move over to a new package name 
> (org.apache.hadoop.mapreduce) and deprecate the old interfaces and package. 
> Basically, it will replace:
> package org.apache.hadoop.mapred;
> public interface Mapper extends JobConfigurable, Closeable {
>   void map(WritableComparable key, Writable value, OutputCollector output, 
> Reporter reporter) throws IOException;
> }
> with:
> package org.apache.hadoop.mapreduce;
> public interface Mapper extends Closable {
>   void map(MapContext context) throws IOException;
> }
> where MapContext has the methods like getKey(), getValue(), collect(Key, 
> Value), progress(), etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1230) Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes

Reply via email to