Re: [Nutch-general] Why does Nutch crawl keep on throwing an exception?

Micah Vivion Mon, 30 Jul 2007 13:17:01 -0700

Greetings,

Thank you for the reply:
My configuration is as follows:


Ubuntu 7.04 amd64

java -version
java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_12-b04, mixed mode)

Nutch 0.9 - I made some minor changes to the BasicIndexingFilter.javafile:

//I modified the field of url to be UN_TOKENIZED

doc.add(new Field("url", url.toString(), Field.Store.YES,Field.Index.UN_TOKENIZED));

//I also changed the field of content to be stored

doc.add(new Field("content", parse.getText(), Field.Store.YES,Field.Index.TOKENIZED));


I recompiled nutch - all looked fine from here

The rest of the libraries that I am using are the default ones thatcome with Nutch 0.9:

commons-cli-2.0-SNAPSHOT.jar
commons-codec-1.3.jar
commons-httpclient-3.0.1.jar
commons-lang-2.1.jar
commons-logging-1.0.4.jar
commons-logging-api-1.0.4.jar
hadoop-0.12.2-core.jar
jakarta-oro-2.0.7.jar
jets3t-0.5.0.jar
jetty-5.1.4.jar
jetty-5.1.4.LICENSE.txt
jetty-ext
junit-3.8.1.jar
junit-3.8.1.LICENSE.txt
log4j-1.2.13.jar
lucene-core-2.1.0.jar
lucene-misc-2.1.0.jar
native
pmd-ext
servlet-api.jar
taglibs-i18n.jar
taglibs-i18n.tld
xerces-2_6_2-apis.jar
xerces-2_6_2.jar

The rest of the conf files are pretty much unchanged as what isshipped with Nutch 0.9


I have not enabled Hadoop - I am using just a local store.

Anyone have any ideas or area that I should be looking in?

Thanks,

Micah




On Jul 30, 2007, at 11:30 AM, DES wrote:

Hi Micah,

what is your configuration? do you have multiple nodes or is it a
single machine? what version of hadoop library are you using?

des

On 7/30/07, Micah Vivion <[EMAIL PROTECTED]> wrote:

Greetings,

So this one has me stumped a little bit. I am running a fairly simple
Nutch Crawl on our local intranet site or on our partners intranet
sites. Every now and then when doing a 'bin/nutch crawl urlfile -dir
webindex/ -depth 5' I get an exception of:
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: /home/mvivion/webindex/target.com/indexes
Exception in thread "main" java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
604)
         at org.apache.nutch.indexer.DeleteDuplicates.dedup
(DeleteDuplicates.java:439)
         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

Has anyone see this before? Any solutions to resolve this crash?

Thanks!!!

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Why does Nutch crawl keep on throwing an exception?

Reply via email to