Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Maryanne.DellaSalla
I am attempting to familiarize myself with hadoop and utilizing
MapReduce in order to process system log files.  I had tried to start
small with a simple map reduce program similar to the word count example
provided.  I wanted for each line that I had read in, to grab the 5th
word as my output key, and the constant 1 as my output value.  This
seemed simple enough, but would consistently time out on mapping.  I
then attempted to run the WordCount example on my data to see if that
was the problem.  It was not, as the WordCount example quickly finished
with accurate results.  I then took the WordCount example, and added a
counter to the map so that it would only output the 5th word in the
line.  When I ran this, it ran for 18+ hrs with little to no progress.
I tried a programmatically identical way of getting the 5th word, and it
once again timed out.  Any help would be appreciated.

I am running in the Pseudo-Distributed layout described by the
Quickstart on a Windows XP machine running Cygwin.  I am working on
hadoop-0.21.0.  I have verified that I can run the examples provided and
that my nodes and trackers are running properly.

I took the WordCount example code described here: 

http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache
/hadoop/examples/WordCount.java?r=1027  

and changed the Map function to:
  public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {
   
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
   
public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
  int count = 0;
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
  }
}
  }

Which after 18 hrs 35 min had map 0.55% complete.  There were no issues
in the logs or the command line.  Running this program without the count
variable maps in less than a minute on the same data.  When I changed it
to call itr.nextToken() 4 times before calling it a 5th to set the word,
it timed out.  I previously verified that the data always had more than
5 tokens per line.  My similar program which timed out regularly used
the split function on my delimiter to pull out the 5th word.  

Thank you for your help!
-   Maryanne DellaSalla


RE: Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Maryanne.DellaSalla
Ahh, well that's embarrassing and explains the situation where it runs
for many hours. 

I am still baffled as to the split on delimiter version timing out,
though. 

  String line = value.toString();
  String[] splitLine = line.split(,);
  
  if( splitLine.length = 5 )
  {
word.set(splitLine[4]);
output.collect(word, one);
  }

This runs and times out on map every time.

Thanks.

Maryanne DellaSalla 

-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Tuesday, May 24, 2011 12:25 PM
To: common-user@hadoop.apache.org
Subject: Re: Simple change to WordCount either times out or runs 18+ hrs
with little progress

itr.nextToken() is inside the if.

On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote:

while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
  }