Re: [Nutch-general] Need Help with crawl-urlfilter.txt

Ravi Chintakunta Thu, 22 Mar 2007 18:52:24 -0800

Hi Sriram,

In regex, . matches to any single character, and following . with a *
matches that single character zero or more times. That is,  .* in
combination is a wildcard match.


So modifying your regex to:

-^http://wiki.mydomain.com/index.php/Special:.*

should fix the problem.

- Ravi Chintakunta


On 3/22/07, SriramG <[EMAIL PROTECTED]> wrote:
>
> I trying to crawl a wikipedia site.
>
> I want to skip any url which has the term Special:
>
> Eg:
> https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
> https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
> https://wiki.mydomain.com/index.php/Special:Watchlist
> https://wiki.mydomain.com/index.php/Special:Contributions/SName
> https://wiki.mydomain.com/index.php/Special:Recentchanges
>
> This is my crawl-urlfilter.txt
> -^http://wiki.mydomain.com/index.php/Special:
> -^http://wiki.mydomain.com/index.php/Special:*
> -^http://wiki.mydomain.com/index.php/Special:*/
> -^http://wiki.mydomain.com/index.php/Special:*/*
> -^https://wiki.mydomain.com/index.php/Special:Upload
> +^https://wiki.mydomain.com/index.php
> -.
>
> But I still see the fetcher logs.
>
> 2007-03-22 12:52:15,387 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php
> 2007-03-22 12:52:32,128 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Telecom
> 2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Contributions/SName
> 2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Watchlist
> 2007-03-22 12:52:32,179 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Preferences
> 2007-03-22 12:52:32,198 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Recentchanges
> 2007-03-22 12:52:32,322 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Talk:Main_Page
> 2007-03-22 12:52:32,323 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
> 2007-03-22 12:52:32,326 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/BCP
> 2007-03-22 12:52:32,339 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
> 2007-03-22 12:52:32,343 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Network_Engineering
>
>
> Not sure whats wrong in my regular expression.
>
> Any help please.
>
>
> --
> View this message in context: 
> http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Need Help with crawl-urlfilter.txt

Reply via email to