On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote:
> hi there,
>
> i browsed through the list and had some different searches but i do not
> find, what i'm looking for.
>
> i got an index which is generated by a bot, collecting websites. there
> are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1
> these different urls have the same content and when u search for a word,
> matching, both are returned, which is correct.
>
> they have excatly the same score because of there content an so one, so
> i would like to know if its possible "to group by" (mysql, of course)
> the returned score, so that only the first match is collected into
> "Hits" and all following matches with the same score are ignored.
>
> it would be great if anyone has an idea how to do that.

You can implement your own HitCollector and pass it to IndexSearcher.search()
Have a look at the javadocs of the org.apache.lucene.search package,
it's quite straightforward. The PriorityQueue from the
util package is useful to collect results. For every distinct score you could
store an int[] of document nrs in there while collecting the hits.
Basically you'll end up implementing your own Hits class.

For URL's that have the same content, it's better
to store multiple URL's for the same document. However, this
merging is normally done by a crawler because the same contents
means the same outgoing URL's. Crawlers also keep track
of multiple host names resolving to the same IP address.

In case you need to crawl and index an intranet or more, have a look
at Nutch.

Regards,
Paul Elschot




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to