I follow the web page "http://lucene.apache.org/nutch/tutorial8.html#Intranet%3A+Running+the+Crawl ", and execute the "$ bin/nutch generate crawl/crawldb crawl/segments" command in my cygwin environment. I got the following error message: Generator: starting Generator: segment: crawl/segments/20070702142541 Generator: Selecting best-scoring urls due for fetch. Exception in thread "main" java.io.IOException: Input directory E:/cygwin/home/A dministrator/nutch-0.8.1/crawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Generator.generate(Generator.java:319) at org.apache.nutch.crawl.Generator.main(Generator.java:395)
Do you know how to solve the problem? Your any feedback will be much appreciated. Adam Shuy, President ePacific Web Design & Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -----Original Message----- From: Chris Hane [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 12:45 PM To: [EMAIL PROTECTED] Subject: Re: Adding meta data to searched documents Enis - thanks for the pointer. Enis Soztutar wrote: > You can write index plugins. Please first read the (slighlty outdated) > tutorial and then check http://wiki.apache.org/nutch/PluginCentral. > Optionally you may want to write html parse plugins depending on the > source of the data. > > Chris Hane wrote: >> I am looking to use nutch to crawl/index a website. A lot of the >> pages have videos on them. We have transcripts for the videos that we >> would like to be included for indexing; but we do not want to put the >> transcripts on the web pages. >> >> Is there a way to "add" this information to a given web page for >> purposes of indexing as part of the crawl process? Maybe another >> point in the process before the index is generated? I am hoping there >> is a point in the crawl process where I can add augmented content to a >> page in the nutch segment (rough thought based on very limited time >> spent looking at nutch). >> >> We are comfortable using java and can write custom code as needed. I >> would appreciate any pointers on where to look in the nutch code. >> >> Thanks in advance, >> Chris..... >> > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
