[Nutch-general] Not crawling specific pages

Dean Elwood Sun, 06 Nov 2005 13:38:27 -0800

Hi guys, first post to this list, and a new Nutch user so please be gentle;-)

I would like to first of all congratulate all involved in this project. I'vebeen playing with Nutch all weekend and have been extremely impressed withit so far. I'm experimenting by indexing one of my own websites (some 30kpages), an index which I hope will ultimately become the main medium forsearching the site by our users. The current search mechanism is a simpletext type system (PostNuke) and far from ideal.

One of the issues I've come across is having "high ranking" pages which areactually error pages. This is largely due to site structure (really, theerror page needs to contain an HTTP header error like a 404). Regardless ofthe "real" problem I do need to find a way around it.


The error page is ".error_page".

As this is for an internal site search engine I've used the example givenfor intranet use in the tutorial.


Here's the relevant extract from my crawl-urlfilter.txt file:-

# Site to crawl
+^http://([a-z0-9]*\.)*mysite.org/

# ignore error pages
-^http://www.mysite.org/view/.error_page

As you can see, I took a "guess" that I could simply use the minus sign as ameans of ignoring the page that I want excluded.


This doesn't seem to work. Any guidance would be greatly appreciated.

Thanks,

Dean



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Not crawling specific pages

Reply via email to