Problem with Crawler and Parent Directories

Wolf Fischer Thu, 02 Apr 2009 08:23:47 -0700

Hi there,

i currently try to use Nutch for a local file directory. I have the urlto the directory, which looks like the following:

file:///C:/test/

in crawl-urlfilter.txt I added +.* for testing purposes, however thisresulted in the famous "bug" of also looking through the parentdirectories. So i looked into the FAQ as well as the mailing listarchive and found the solution: I simply should add something like

+^file:///c:/top/directory/^
-.
to the urlfilter.txt. So I did:
+^file:///c:/test/
-.

However if I do this the fetcher does not get any url at all andimmediately exits because of "no more URLs to fetch."I have no idea why this is not working. I tried several other solutionsand simply cant get it to work the way i want it to work. Can somebodyplease give me a hint on what i am doing wrong?


Thanks in advance!

Wolf

--
Dipl.-Inf. Wolf Fischer

Programming Distributed Systems Lab
Institute of Computer Science
University of Augsburg
Universitätsstr. 14
86135 Augsburg, Germany

Tel:    +49 821 598-3102
Fax:    +49 821 598-2175

Problem with Crawler and Parent Directories

Reply via email to