Ah, OK, I get it.  Sadly for me, this precise approach is probably not going
meet my requirements, but it really helps to get me going, and I think a
variation on it will suit me quite well.  I'm very much looking forward to
seeing the script that automates this.

I have one minor quibble with this:


> And yes you may have some duplicates in your indexes but this is taken
> care of in the search itself (there is a dedupField option in
> NutchBean).  Of the duplicates the one with the best score (most
> relevant) should be returned.


If you truly have two versions of the same page (same URL), I can imagine a
scenario where you don't necessarily want the one with the highest score.
If the content has changed, you want the one that was most recently
fetched.  You want the best chance of showing an excerpt from the current
page and scoring the current content against other pages that are also hits.

Many thanks for all your help; it clears up a lot for me.

- Charlie
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to