Ah, thanks EM - so basically we need to escape the dots....... something
that didn't even occur to me -many thanks!
Dean
----- Original Message -----
From: "EM" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Sunday, November 06, 2005 9:43 PM
Subject: Re: Not crawling specific pages
Here's the relevant extract from my crawl-urlfilter.txt file:-
# Site to crawl
+^http://([a-z0-9]*\.)*mysite.org/
# ignore error pages
-^http://www.mysite.org/view/.error_page
As you can see, I took a "guess" that I could simply use the minus sign
as a means of ignoring the page that I want excluded.
This doesn't seem to work. Any guidance would be greatly appreciated.
any dot in the url, has to be substituted with "\." without the quotes.
Just putting a dot in the expression will match any character.
For more, google for "regex"
Hope this helps,
EM
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general