Hi guys, first post to this list, and a new Nutch user so please be gentle
;-)
I would like to first of all congratulate all involved in this project. I've
been playing with Nutch all weekend and have been extremely impressed with
it so far. I'm experimenting by indexing one of my own websites (some 30k
pages), an index which I hope will ultimately become the main medium for
searching the site by our users. The current search mechanism is a simple
text type system (PostNuke) and far from ideal.
One of the issues I've come across is having "high ranking" pages which are
actually error pages. This is largely due to site structure (really, the
error page needs to contain an HTTP header error like a 404). Regardless of
the "real" problem I do need to find a way around it.
The error page is ".error_page".
As this is for an internal site search engine I've used the example given
for intranet use in the tutorial.
Here's the relevant extract from my crawl-urlfilter.txt file:-
# Site to crawl
+^http://([a-z0-9]*\.)*mysite.org/
# ignore error pages
-^http://www.mysite.org/view/.error_page
As you can see, I took a "guess" that I could simply use the minus sign as a
means of ignoring the page that I want excluded.
This doesn't seem to work. Any guidance would be greatly appreciated.
Thanks,
Dean
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general