Figured this one out, just in case some other newbe has the same problem.

Windows places hidden files in the urls dir if one customizes the folder 
view. These files must be removed or Nutch thinks they url files and 
processes them. One the hidden files are removed all is well.

jim s



anyone else has
----- Original Message ----- 
From: "jim shirreffs" <[EMAIL PROTECTED]>
To: "nutch lucene apache" <[email protected]>
Sent: Thursday, April 05, 2007 11:51 AM
Subject: Run Job Crashing


> Nutch-0.8.1
> Windows 2000/Windows XP
> Java 1.6
> cygwin1.dll  nov/2004 and gygwin1 latest release
>
>
> Very strange, ran the crawler once
>
> S bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> and everything worked until this error
>
>
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20070404094549
> Indexer: adding segment: crawl/segments/20070404095026
> Indexer: adding segment: crawl/segments/20070404095504
> Optimizing index.
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>        at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
>
>
> Tried running the crawler again
>
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> and now I consistantly get this error
>
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> run java in NUTCH_JAVA_HOME D:\java\jdk1.6
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
> I have one file localhost in my url dir and it looks like this
>
> http://localhost
>
> My  crawl-urlfiler.xml looks like this
>
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto|swf|sw):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break 
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*localhost/
>
> # skip everything else
>
> My nutch-site.xml looks like this
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>RadioCity</value>
>  <description></description>
> </property>
>
> <property>
>  <name>http.agent.description</name>
>  <value>nutch web crawler</value>
>  <description></description>
> </property>
>
> <property>
>  <name>http.agent.url</name>
>  <value>www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch</value>
>  <description></description>
> </property>
>
> <property>
>  <name>http.agent.email</name>
>  <value>jpsb at flash.net</value>
>  <description></description>
> </property>
> </configuration>
>
>
> I am getting the same behavor on two separate hosts.  If anyone can 
> suggest what I might be doing wrong I would greatly appreicate it.
>
> jim s
>
> PS tried to mail from a different host but did not see message in mailing 
> list.  Hope only this messages gets into mailing list. 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to