Hmmm, creating a seed directory with a file in it named urls doesn't seem to
work.
[EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/conf
$ nutch crawl /cygdrive/c/nutch-2007-07-26_04-01-20/seed -dir
/cygdrive/c/nutch-2007-07-26_04-01-20/zzzz/sf911truth -depth 3 -topN 200
crawl started in: /cygdrive/c/nutch-2007-07-26_04-01-20/zzzz/sf911truth
rootUrlDir = /cygdrive/c/nutch-2007-07-26_04-01-20/seed
threads = 10
depth = 3
topN = 200
Injector: starting
Injector: crawlDb: /cygdrive/c/nutch-2007-07-26_04-01-20/zzzz/sf911truth/crawldb
Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/seed
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path doesnt exist : /cygdrive/c/nutch-2
007-07-26_04-01-20/seed
at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
Maybe, as I speculated before, it's some kind of pathing problem with cygwin in
hadoop. Maybe I'll try Susam Pal's suggestion of installing a JDK within
cygwin that thinks in terms of unix paths.
--Kai
----- Original Message ----
From: feran <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, July 27, 2007 6:20:56 AM
Subject: Re: cygwin - Input path doesnt exist
This is the problem:
Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
urls.txt is not a Directory.
Crawl takes a Directory parameter, not the direct file. Inside the
directory, it checks for a flat file with no extension called urls.
- feran_a
----- Original Message -----
From: "Kai_testing Middleton" <[EMAIL PROTECTED]>
To: "nutch user" <[EMAIL PROTECTED]>
Sent: Friday, July 27, 2007 2:56 AM
Subject: cygwin - Input path doesnt exist
I've freshly installed a nutch nightly build onto my laptop using an
up-to-date cygwin. Basically I just downloaded the .tar.gz, ran ant, and
verified that $NUTCH_HOME/bin/nutch works (gives me the help screen). I set
up nutch-site.xml, urls.txt and attempted to crawl. However, I get an
exception in org.apache.hadoop.mapred.InvalidInputException. The hadoop.log
doesn't report the error, just the command line crawl command. Anyone seen
this before?
$ nutch crawl /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt -dir
/cygdrive/c/nutch-2007-07-26_04-01-20/content /sf911truth -depth 3 -topN 200
crawl started in: /cygdrive/c/nutch-2007-07-26_04-01-20/content/sf911truth
rootUrlDir = /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
threads = 10
depth = 3
topN = 200
Injector: starting
Injector: crawlDb:
/cygdrive/c/nutch-2007-07-26_04-01-20/content/sf911truth/crawldb
Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path doesnt exist :
/cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt
at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
____________________________________________________________________________________
Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
http://new.toolbar.yahoo.com/toolbar/features/mail/index.php
____________________________________________________________________________________
Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general