Re: Batching key/value pairs to map
I've got a program that starts by hooking into a legacy system to look up it's maps out of a db and the keys are sparse. I'm sure there might be another way to do this, but this was by far the easiest/simplest solution. Jimmy Wan On Mon, Feb 23, 2009 at 19:39, Edward Capriolo wrote: > We have a MR program that collects once for each token on a line. What > types of applications can benefit from batch mapping?
Re: Batching key/value pairs to map
We have a MR program that collects once for each token on a line. What types of applications can benefit from batch mapping?
Re: Batching key/value pairs to map
On Feb 23, 2009, at 2:19 PM, Jimmy Wan wrote: I'm not sure if this is possible, but it would certainly be nice to either: 1) pass the OutputCollector and Reporter to the close() method. 2) Provide accessors to the OutputCollector and the Reporter. If you look at the 0.20 branch, which hasn't released yet, there is a new map/reduce api. That api does provide a lot more control. Take a look at Mapper, which provide setup, map, and cleanup hooks: http://tinyurl.com/bquvxq The map method looks like: /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ @SuppressWarnings("unchecked") protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((KEYOUT) key, (VALUEOUT) value); } But there is also a run method that drives the task. The default is given below, but it can be overridden by the application. /** * Expert users can override this method for more complete control over the * execution of the Mapper. * @param context * @throws IOException */ public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); } Clearly, in your application you could override run to make a list of 100 key, value pairs or something. -- Owen
Re: Batching key/value pairs to map
Great, thanks Owen. I actually ran into the object reuse problem a long time ago. The output of my MR processes gets turned into a series of large INSERT statements that wasn't performing unless I batched them in inserts of several K entries. I'm not sure if this is possible, but it would certainly be nice to either: 1) pass the OutputCollector and Reporter to the close() method. 2) Provide accessors to the OutputCollector and the Reporter. Now every single one of my maps is going to have a pair of 1-2 extra no-ops. I'll check to see if that's on the list of outstanding FRs. On Mon, Feb 23, 2009 at 15:30, Owen O'Malley wrote: > On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan wrote: > >> part of my map/reduce process could be greatly sped up by mapping >> key/value pairs in batches instead of mapping them one by one. >> Can I safely hang onto my OutputCollector and Reporter from calls to map? > > Yes. You can even use them in the close, so that you can process the last > batch of records. *smile* One problem that you will quickly hit is that > Hadoop reuses the objects that are passed to map and reduce. So, you'll need > to clone them before putting them into the collection. > > I'm currently running Hadoop 0.17.2.1. Is this something I could do in >> Hadoop 0.19.X? > > I don't think any of this changed between 0.17 and 0.19, other than in 0.17 > the reduce's inputs were always new objects. In 0.18 and after, the reduce's > inputs are reused.
Re: Batching key/value pairs to map
On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan wrote: > part of my map/reduce process could be greatly sped up by mapping > key/value pairs in batches instead of mapping them one by one. > Can I safely hang onto my OutputCollector and Reporter from calls to map? Yes. You can even use them in the close, so that you can process the last batch of records. *smile* One problem that you will quickly hit is that Hadoop reuses the objects that are passed to map and reduce. So, you'll need to clone them before putting them into the collection. I'm currently running Hadoop 0.17.2.1. Is this something I could do in > Hadoop 0.19.X? > I don't think any of this changed between 0.17 and 0.19, other than in 0.17 the reduce's inputs were always new objects. In 0.18 and after, the reduce's inputs are reused. -- Owen
Batching key/value pairs to map
part of my map/reduce process could be greatly sped up by mapping key/value pairs in batches instead of mapping them one by one. I'd like to do the following: protected abstract void batchMap(OutputCollector k2V2OutputCollector, Reporter reporter) throws IOException; public void map(K1 key1, V1 value1, OutputCollector output, Reporter reporter) throws IOException { keys.add(key1.copy()); values.add(value1.copy()); if (++currentSize == batchSize) { batchMap(output, reporter); clear(); } } public void close() throws IOException { if (currentSize > 0) { // I don't have access to my OutputCollector or Reporter here! batchMap(output, reporter); clear(); } } Can I safely hang onto my OutputCollector and Reporter from calls to map? I'm currently running Hadoop 0.17.2.1. Is this something I could do in Hadoop 0.19.X?