Hi guys, first post to this list, and a new Nutch user so please be gentle ;-)

I would like to first of all congratulate all involved in this project. I've been playing with Nutch all weekend and have been extremely impressed with it so far. I'm experimenting by indexing one of my own websites (some 30k pages), an index which I hope will ultimately become the main medium for searching the site by our users. The current search mechanism is a simple text type system (PostNuke) and far from ideal.

One of the issues I've come across is having "high ranking" pages which are actually error pages. This is largely due to site structure (really, the error page needs to contain an HTTP header error like a 404). Regardless of the "real" problem I do need to find a way around it.

The error page is ".error_page".

As this is for an internal site search engine I've used the example given for intranet use in the tutorial.

Here's the relevant extract from my crawl-urlfilter.txt file:-

# Site to crawl
+^http://([a-z0-9]*\.)*mysite.org/

# ignore error pages
-^http://www.mysite.org/view/.error_page

As you can see, I took a "guess" that I could simply use the minus sign as a means of ignoring the page that I want excluded.

This doesn't seem to work. Any guidance would be greatly appreciated.

Thanks,

Dean



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to