RE: Independent Map Reduce to parse Nutch content (Cont.)

Markus Jelsma Mon, 06 Jan 2014 08:19:21 -0800

 Hi - Check the logs first.

-----Original message-----
From: Bin Wang<binwang...@gmail.com>
Sent: Saturday 4th January 2014 21:47
To: dev@nutch.apache.org
Subject: Re: Independent Map Reduce to parse Nutch content (Cont.)


Hi Tejas,

I started an AWS instance and run hadoop in single node mode.

When I do..

hadoop -jar example.jar hdfsinput/ hdfsoutput/

Everything works perfect as I expected: a bunch of staff got printed to the 
screen and both mappers and reducers got finished without question. In the end, 
the expected output sits in the hdfs output directory.

However, when I tried to run the jar file without hadoop:

java -jar example.jar localinput/ localoutput/

It will finish all the mappers without problem but still.. errored out after 
all the mappers....

Exception in thread "main" java.io.IOException: Job failed!

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)

        at arrow.ParseMapred.run(ParseMapred.java:70)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

        at arrow.ParseMapred.main(ParseMapred.java:18)

I am so confused now why my code doesnt work locally...

Based on my understanding, I can see Nutch constantly uses Hadoop API without 
hadoop pre-installed.. why cant my code work..

Well, any hint or directional guidance will be appreciated, many thanks!

/usr/bin

On Sat, Jan 4, 2014 at 12:38 AM, Tejas Patil <tejas.patil...@gmail.com 
<mailto:tejas.patil...@gmail.com>> wrote:

Hi Bin Wang,

I would suggest you to NOT use eclipse and run your code over command line. Use 
logger statements and see the logs for full stack traces of the failure. In my 
personal experience, logs are the best way to debug hadoop code compared to 
Eclipse debugger.

Thanks,

Tejas

On Fri, Jan 3, 2014 at 8:56 PM, Bin Wang <binwang...@gmail.com 
<mailto:binwang...@gmail.com>> wrote:

Hi,

I tried to modify the code here to parse the nutch content data...

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
 
<http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup>

And in the end of this email is a prototype that I have written to run map 
reduce to calculate the HTML content length of each URL that I have scraped.

The mapper part runs perfectly fine as expected, however, the whole program 
stops after all the mappers finished and the reducer did not get a chance to 
run: (I am sure there are certain number of pages got scraped and in the 
Eclipse console, there are same number of Mapper.. so I assume all the mapper 
finished.)

Can anyone, who is pretty into writing java map reduce job take a look at my 
code and see what the error might be... I am not a Java developer at all so any 
debug trick or common sense will be appreciated!

(I heard that it is fairly hard to debug code written using hadoop API... is 
that true?)

Many thanks!

/usr/bin

_____________________________________________________

Eclipse Console Info

Starting Mapper ...

Key: http://url1 <http://url1>

Result: 134943

Starting Mapper ...

Key: http://url2 <http://url2>

Result: 258588

Exception in thread "main" java.io.IOException: Job failed!

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)

        at arrow.ParseMapred.run(ParseMapred.java:68)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

        at arrow.ParseMapred.main(ParseMapred.java:18)

_____________________________________________________

// my code

package example;

import ...;

public class ParseMapred extends Configured implements Tool,

                Mapper<WritableComparable<?>, Content, Text, IntWritable>,

                Reducer<Text, IntWritable, Text, IntWritable> {

        public static void main(String[] args) throws Exception {

                int res = ToolRunner.run(NutchConfiguration.create(),

                                new ParseMapred(), args);

                System.exit(res);

        }

        public void configure(JobConf job) {

                setConf(job);

        }

        public void close() throws IOException {}

        public void reduce(Text key, Iterator<IntWritable> values,

                        OutputCollector<Text, IntWritable> output, Reporter 
reporter)

                        throws IOException {

                System.out.println("Starting Reducer ...");

                System.out.println("Reducer: " + "key" + key);

            output.collect(key, values.next()); // collect first value

        }

        public void map(WritableComparable<?> key, Content content,

                        OutputCollector<Text, IntWritable> output, Reporter 
reporter)

                        throws IOException {

                Text url = new Text();

                IntWritable result = new IntWritable();

                url.set("fail");

                result = new IntWritable(1);

                try {

                        System.out.println("Starting Mapper ...");

                        url.set(key.toString());

                        result = new IntWritable(content.getContent().length);

                        System.out.println("Key: " + url);

                        System.out.println("Result: " + result);

                        output.collect(url, result);

                } catch (Exception e) {

                        // TODO Auto-generated catch block

                        output.collect(url, result);

                }

        }

        public int run(String[] args) throws Exception {

            JobConf job = new NutchJob(getConf());

            job.setJobName("ParseData");

            FileInputFormat.addInputPath(job, new Path("/Users/.../data/"));

            FileOutputFormat.setOutputPath(job, new Path("/Users/.../result"));

            job.setInputFormat(SequenceFileInputFormat.class);

            job.setOutputFormat(TextOutputFormat.class);

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(IntWritable.class);

            job.setMapperClass(ParseMapred.class);

            job.setReducerClass(ParseMapred.class);

            JobClient.runJob(job);

                return 0;

        }

}

RE: Independent Map Reduce to parse Nutch content (Cont.)

Reply via email to