Re: [Nutch-dev] search.jsp host grouping patch

Doug Cutting Mon, 26 Jul 2004 13:41:01 -0700

Stefan Groschupf wrote:

There was a question in one of my last mails about document ids and more then one segment index. In case you can answer this question and suggest a solution to get an unique document id then we can heavily improve the speed.


Is this the question?

Stefan Groschupf wrote:
> So a question, as far i understand lucene the document number  (==
> hit.getIndexDocNo()) is unique per index.
> If that is true than hit.getIndexDocNo() is not unique since hits
> can be
> found in different segment indexes and on different servers, isn't it?
> Is there any chance to get a unique id of the document that is not
> stored in the details?

There are several ways to uniquely identify a page. The combination of indexDocNo and segment name is unique. If you've run "dedup" on the segments, then both the URL and the MD5 digest are also unique.

What do you need a unique id for? Each page should only occur once in a hit list. Search-time duplicate detection is just done to reduce the number of hits per site, no?

Doug


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] search.jsp host grouping patch

Reply via email to