Hi,

I am trying to use the latest nutch-trunk version but I am facing
unexpected "Job failed!" exception. It seems that all crawling work
has been already done but some threads are hunged which results into
exception after some timeout.

I am not sure whether this is a real nutch issue or just mine
misunderstanding of proper configuration.

The following are the details:
I am trying to run nutch-trunk version on one machine (Linux). I used
the latest svn and produced fresh installation package using "ant
tar". Then I modified nutch-site.xml only (see attachement) - I
believe I didn't change anything special. I was doing modifications to
[fetcher.threads.fetch] and [fetcher.threads.per.host] as well but
this didn't seem to help.

Typically, nutch crawl process seemed to work fine and it crawled all
documents on my local apache server (both nutch and apache run on the
same machine) but then it didn't stop but was waiting for something to
finish. Since then it was just producing lines like [060103 231602 16
pages, 0 errors, 0.4 pages/s, 305 kb/s, ] into log where the later two
numbers (pages/s, kb/s) where decreasing as time went by (that is
logical).

Then I receive the following exception:
Sometime it even contains log massege saying
"Aborting with "+activeThreads+" hung threads." where activeThreads
was some number (this number differs based on conf setup).

... (see crawl.log attachement file for whole log)
060103 231602 16 pages, 0 errors, 0.4 pages/s, 305 kb/s,
060103 231602 16 pages, 0 errors, 0.4 pages/s, 305 kb/s,
java.lang.NullPointerException
        at 
java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:980)
        at java.lang.Float.parseFloat(Float.java:222)
        at 
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:84)
        at 
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:80)
        at org.apache.nutch.mapred.ReduceTask$2.collect(ReduceTask.java:247)
        at 
org.apache.nutch.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41)
        at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
        at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
060103 231603  map 100%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:344)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:111)

Does anybody know what is wrong?

Regards,
Lukas
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<nutch-conf>

<property>
  <name>http.content.limit</name>
  <value>65536000</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>db.max.outlinks.per.page</name>
  <value>1000</value>
  <description>The maximum number of outlinks that we'll process for a page.
  </description>
</property>

<property>
  <name>db.max.anchor.length</name>
  <value>1000</value>
  <description>The maximum number of characters permitted in an anchor.
  </description>
</property>

<property>
  <name>fetcher.threads.fetch</name>
  <value>1</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>10</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

<property>
  <name>indexer.max.title.length</name>
  <value>1000</value>
  <description>The maximum number of characters of a title that are indexed.
  </description>
</property>

<property>
  <name>searcher.dir</name>
  <value>crawl</value>
  <description>
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

<property>
  <name>urlnormalizer.class</name>
  <value>org.apache.nutch.net.BasicUrlNormalizer</value>
  <description>Name of the class used to normalize URLs.</description>
</property>

<property>
  <name>plugin.includes</name>
  <value>nutch-extensionpoints|protocol-(http|httpclient|file)|urlfilter-regex|parse-(text|html|pdf|msword)|index-(basic|more)|query-(basic|site|url|more)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/lukas/nutch/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a space- or comma- separated list of
  directories on different devices in order to spread disk i/o.
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/home/lukas/nutch/mapred/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>/home/lukas/nutch/mapred/temp</value>
  <description>A shared directory for temporary files.
  </description>
</property>

</nutch-conf>








Reply via email to