Re: Batching key/value pairs to map

2009-02-24 Thread Jimmy Wan
I've got a program that starts by hooking into a legacy system to look
up it's maps out of a db and the keys are sparse. I'm sure there might
be another way to do this, but this was by far the easiest/simplest
solution.

Jimmy Wan

On Mon, Feb 23, 2009 at 19:39, Edward Capriolo  wrote:
> We have a MR program that collects once for each token on a line. What
> types of applications can benefit from batch mapping?


Re: Batching key/value pairs to map

2009-02-23 Thread Edward Capriolo
We have a MR program that collects once for each token on a line. What
types of applications can benefit from batch mapping?


Re: Batching key/value pairs to map

2009-02-23 Thread Owen O'Malley


On Feb 23, 2009, at 2:19 PM, Jimmy Wan wrote:


 I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.


If you look at the 0.20 branch, which hasn't released yet, there is a  
new map/reduce api. That api does provide a lot more control. Take a  
look at Mapper, which provide setup, map, and cleanup hooks:


http://tinyurl.com/bquvxq

The map method looks like:

  /**
   * Called once for each key/value pair in the input split. Most  
applications

   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value,
  Context context) throws  
IOException, InterruptedException {

context.write((KEYOUT) key, (VALUEOUT) value);
  }

But there is also a run method that drives the task. The default is  
given below, but it can be overridden by the application.


  /**
   * Expert users can override this method for more complete control  
over the

   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException,  
InterruptedException {

setup(context);
while (context.nextKeyValue()) {
  map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
  }

Clearly, in your application you could override run to make a list of  
100 key, value pairs or something.


-- Owen


Re: Batching key/value pairs to map

2009-02-23 Thread Jimmy Wan
Great, thanks Owen. I actually ran into the object reuse problem a
long time ago. The output of my MR processes gets turned into a series
of large INSERT statements that wasn't performing unless I batched
them in inserts of several K entries. I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.

Now every single one of my maps is going to have a pair of 1-2 extra no-ops.

I'll check to see if that's on the list of outstanding FRs.

On Mon, Feb 23, 2009 at 15:30, Owen O'Malley  wrote:
> On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan  wrote:
>
>> part of my map/reduce process could be greatly sped up by mapping
>> key/value pairs in batches instead of mapping them one by one.
>> Can I safely hang onto my OutputCollector and Reporter from calls to map?
>
> Yes. You can even use them in the close, so that you can process the last
> batch of records. *smile* One problem that you will quickly hit is that
> Hadoop reuses the objects that are passed to map and reduce. So, you'll need
> to clone them before putting them into the collection.
>
> I'm currently running Hadoop 0.17.2.1. Is this something I could do in
>> Hadoop 0.19.X?
>
> I don't think any of this changed between 0.17 and 0.19, other than in 0.17
> the reduce's inputs were always new objects. In 0.18 and after, the reduce's
> inputs are reused.


Re: Batching key/value pairs to map

2009-02-23 Thread Owen O'Malley
On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan  wrote:

> part of my map/reduce process could be greatly sped up by mapping
> key/value pairs in batches instead of mapping them one by one.
> Can I safely hang onto my OutputCollector and Reporter from calls to map?


Yes. You can even use them in the close, so that you can process the last
batch of records. *smile* One problem that you will quickly hit is that
Hadoop reuses the objects that are passed to map and reduce. So, you'll need
to clone them before putting them into the collection.

I'm currently running Hadoop 0.17.2.1. Is this something I could do in
> Hadoop 0.19.X?
>

I don't think any of this changed between 0.17 and 0.19, other than in 0.17
the reduce's inputs were always new objects. In 0.18 and after, the reduce's
inputs are reused.

-- Owen


Batching key/value pairs to map

2009-02-23 Thread Jimmy Wan
part of my map/reduce process could be greatly sped up by mapping
key/value pairs in batches instead of mapping them one by one. I'd
like to do the following:
protected abstract void batchMap(OutputCollector
k2V2OutputCollector, Reporter reporter) throws IOException;

public void map(K1 key1, V1 value1, OutputCollector
output, Reporter reporter) throws IOException {
keys.add(key1.copy());
values.add(value1.copy());
if (++currentSize == batchSize) {
batchMap(output, reporter);
clear();
}
}

public void close() throws IOException {
if (currentSize > 0) {
// I don't have access to my OutputCollector or Reporter here!
batchMap(output, reporter);
clear();
}
}

Can I safely hang onto my OutputCollector and Reporter from calls to map?

I'm currently running Hadoop 0.17.2.1. Is this something I could do in
Hadoop 0.19.X?