Re: Problem with Crawler and Parent Directories

Alejandro Gonzalez Thu, 02 Apr 2009 08:36:20 -0700

are you commenting or adapting this line in crawl-urlfilter ?

-^(file|ftp|mailto):




On Thu, Apr 2, 2009 at 5:23 PM, Wolf Fischer <
[email protected]> wrote:

> Hi there,
>
> i currently try to use Nutch for a local file directory. I have the url to
> the directory, which looks like the following:
> file:///C:/test/
> in crawl-urlfilter.txt I added +.* for testing purposes, however this
> resulted in the famous "bug" of also looking through the parent directories.
> So i looked into the FAQ as well as the mailing list archive and found the
> solution: I simply should add something like
> +^file:///c:/top/directory/^
> -.
> to the urlfilter.txt. So I did:
> +^file:///c:/test/
> -.
> However if I do this the fetcher does not get any url at all and
> immediately exits because of "no more URLs to fetch."
> I have no idea why this is not working. I tried several other solutions and
> simply cant get it to work the way i want it to work. Can somebody please
> give me a hint on what i am doing wrong?
>
> Thanks in advance!
>
> Wolf
>
> --
> Dipl.-Inf. Wolf Fischer
>
> Programming Distributed Systems Lab
> Institute of Computer Science
> University of Augsburg
> Universitätsstr. 14
> 86135 Augsburg, Germany
>
> Tel:    +49 821 598-3102
> Fax:    +49 821 598-2175
>
>

Re: Problem with Crawler and Parent Directories

Reply via email to