Ah, thanks EM - so basically we need to escape the dots....... something that didn't even occur to me -many thanks!

Dean

----- Original Message ----- From: "EM" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Sunday, November 06, 2005 9:43 PM
Subject: Re: Not crawling specific pages




Here's the relevant extract from my crawl-urlfilter.txt file:-

# Site to crawl
+^http://([a-z0-9]*\.)*mysite.org/

# ignore error pages
-^http://www.mysite.org/view/.error_page

As you can see, I took a "guess" that I could simply use the minus sign as a means of ignoring the page that I want excluded.

This doesn't seem to work. Any guidance would be greatly appreciated.


any dot in the url, has to be substituted with "\." without the quotes.
Just putting a dot in the expression will match any character.

For more, google for "regex"

Hope this helps,
EM




-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to