This is the problem:

Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt

urls.txt is not a Directory.

Crawl takes a Directory parameter, not the direct file. Inside the 
directory, it checks for a flat file with no extension called urls.

- feran_a
----- Original Message ----- 
From: "Kai_testing Middleton" <[EMAIL PROTECTED]>
To: "nutch user" <[EMAIL PROTECTED]>
Sent: Friday, July 27, 2007 2:56 AM
Subject: cygwin - Input path doesnt exist


I've freshly installed a nutch nightly build onto my laptop using an 
up-to-date cygwin.  Basically I just downloaded the .tar.gz, ran ant, and 
verified that $NUTCH_HOME/bin/nutch works (gives me the help screen).  I set 
up nutch-site.xml, urls.txt and attempted to crawl.  However, I get an 
exception in org.apache.hadoop.mapred.InvalidInputException.  The hadoop.log 
doesn't report the error, just the command line crawl command.  Anyone seen 
this before?


$ nutch crawl /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt -dir 
/cygdrive/c/nutch-2007-07-26_04-01-20/content /sf911truth -depth 3 -topN 200
crawl started in: /cygdrive/c/nutch-2007-07-26_04-01-20/content/sf911truth
rootUrlDir = /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
threads = 10
depth = 3
topN = 200
Injector: starting
Injector: crawlDb: 
/cygdrive/c/nutch-2007-07-26_04-01-20/content/sf911truth/crawldb
Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 
Input path doesnt exist : 
/cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
        at 
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)






____________________________________________________________________________________
Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
http://new.toolbar.yahoo.com/toolbar/features/mail/index.php 



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to