I am attempting to familiarize myself with hadoop and utilizing
MapReduce in order to process system log files. I had tried to start
small with a simple map reduce program similar to the word count example
provided. I wanted for each line that I had read in, to grab the 5th
word as my output key, and the constant 1 as my output value. This
seemed simple enough, but would consistently time out on mapping. I
then attempted to run the WordCount example on my data to see if that
was the problem. It was not, as the WordCount example quickly finished
with accurate results. I then took the WordCount example, and added a
counter to the map so that it would only output the 5th word in the
line. When I ran this, it ran for 18+ hrs with little to no progress.
I tried a programmatically identical way of getting the 5th word, and it
once again timed out. Any help would be appreciated.
I am running in the Pseudo-Distributed layout described by the
Quickstart on a Windows XP machine running Cygwin. I am working on
hadoop-0.21.0. I have verified that I can run the examples provided and
that my nodes and trackers are running properly.
I took the WordCount example code described here:
http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache
/hadoop/examples/WordCount.java?r=1027
and changed the Map function to:
public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
int count = 0;
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
}
}
}
Which after 18 hrs 35 min had map 0.55% complete. There were no issues
in the logs or the command line. Running this program without the count
variable maps in less than a minute on the same data. When I changed it
to call itr.nextToken() 4 times before calling it a 5th to set the word,
it timed out. I previously verified that the data always had more than
5 tokens per line. My similar program which timed out regularly used
the split function on my delimiter to pull out the 5th word.
Thank you for your help!
- Maryanne DellaSalla