Hello,

One quick win could be to change this line:

context.write(word, new IntWritable(val));

Instead of instantiating an IntWritable on each iteration, instantiate it once 
as a member variable (like what you’ve already done for word) and call 
IntWritable#set on each iteration.  This might save some object allocation and 
garbage collection churn.

Beyond that, I’d recommend either profiling or looking at the overall workflow 
to see if something can be changed at a higher level.  You mentioned that this 
is consuming the output of the wordcount example job.  Perhaps you can change 
the wordcount job’s code to write a more efficient representation of the data 
for your use case.  For example, if the counts were stored as binary integers 
directly, then the second job wouldn’t have to pay the cost of re-parsing them 
from strings by calling Integer#valueOf.

I hope this helps.

--Chris Nauroth

From: xeon Mailinglist <xeonmailingl...@gmail.com>
Date: Sunday, August 21, 2016 at 3:56 AM
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Improve IdentityMapper code for wordcount

Hi,
I have created a map method that reads the map output of the wordcount example 
[1]. This example is away from using the IdentityMapper.class that MapReduce 
offers, but this is the only way that I have found to make a working 
IdentityMapper for the Wordcount. The only problem is that this Mapper is 
taking much more time than I wanted. I am starting to think that maybe I am 
doing some redundant stuff. Any help to improve my IdentityMapper code?

[1] Identity mapper

public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, 
IntWritable> {
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        word.set(itr.nextToken());
        Integer val = Integer.valueOf(itr.nextToken());
        context.write(word, new IntWritable(val));
    }

    public void run(Context context) throws IOException, InterruptedException {
        while (context.nextKeyValue()) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        }
    }
}


[2] Map class that generated the mapoutput

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> 
{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());

        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }

    public void run(Context context) throws IOException, InterruptedException {
        try {
            while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), 
context);
            }
        } finally {
            cleanup(context);
        }
    }
}

Thanks,

Reply via email to