Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected "Job failed!" exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout.
I am not sure whether this is a real nutch issue or just mine misunderstanding of proper configuration. The following are the details: I am trying to run nutch-trunk version on one machine (Linux). I used the latest svn and produced fresh installation package using "ant tar". Then I modified nutch-site.xml only (see attachement) - I believe I didn't change anything special. I was doing modifications to [fetcher.threads.fetch] and [fetcher.threads.per.host] as well but this didn't seem to help. Typically, nutch crawl process seemed to work fine and it crawled all documents on my local apache server (both nutch and apache run on the same machine) but then it didn't stop but was waiting for something to finish. Since then it was just producing lines like [060103 231602 16 pages, 0 errors, 0.4 pages/s, 305 kb/s, ] into log where the later two numbers (pages/s, kb/s) where decreasing as time went by (that is logical). Then I receive the following exception: Sometime it even contains log massege saying "Aborting with "+activeThreads+" hung threads." where activeThreads was some number (this number differs based on conf setup). ... (see crawl.log attachement file for whole log) 060103 231602 16 pages, 0 errors, 0.4 pages/s, 305 kb/s, 060103 231602 16 pages, 0 errors, 0.4 pages/s, 305 kb/s, java.lang.NullPointerException at java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:980) at java.lang.Float.parseFloat(Float.java:222) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:84) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:80) at org.apache.nutch.mapred.ReduceTask$2.collect(ReduceTask.java:247) at org.apache.nutch.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) 060103 231603 map 100% Exception in thread "main" java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:344) at org.apache.nutch.crawl.Crawl.main(Crawl.java:111) Does anybody know what is wrong? Regards, Lukas
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> <!-- Put site-specific property overrides in this file. --> <nutch-conf> <property> <name>http.content.limit</name> <value>65536000</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>1000</value> <description>The maximum number of outlinks that we'll process for a page. </description> </property> <property> <name>db.max.anchor.length</name> <value>1000</value> <description>The maximum number of characters permitted in an anchor. </description> </property> <property> <name>fetcher.threads.fetch</name> <value>1</value> <description>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection).</description> </property> <property> <name>fetcher.threads.per.host</name> <value>10</value> <description>This number is the maximum number of threads that should be allowed to access a host at one time.</description> </property> <property> <name>indexer.max.title.length</name> <value>1000</value> <description>The maximum number of characters of a title that are indexed. </description> </property> <property> <name>searcher.dir</name> <value>crawl</value> <description> Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. </description> </property> <property> <name>urlnormalizer.class</name> <value>org.apache.nutch.net.BasicUrlNormalizer</value> <description>Name of the class used to normalize URLs.</description> </property> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-(http|httpclient|file)|urlfilter-regex|parse-(text|html|pdf|msword)|index-(basic|more)|query-(basic|site|url|more)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> <property> <name>http.verbose</name> <value>true</value> <description>If true, HTTP will log more verbosely.</description> </property> <property> <name>fetcher.verbose</name> <value>true</value> <description>If true, fetcher will log more verbosely.</description> </property> <property> <name>mapred.local.dir</name> <value>/home/lukas/nutch/mapred/local</value> <description>The local directory where MapReduce stores intermediate data files. May be a space- or comma- separated list of directories on different devices in order to spread disk i/o. </description> </property> <property> <name>mapred.system.dir</name> <value>/home/lukas/nutch/mapred/system</value> <description>The shared directory where MapReduce stores control files. </description> </property> <property> <name>mapred.temp.dir</name> <value>/home/lukas/nutch/mapred/temp</value> <description>A shared directory for temporary files. </description> </property> </nutch-conf>