Bugs item #989844, was opened at 2004-07-13 03:32
Message generated for change (Comment added) made by abial
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=989844&group_id=59548
Category: searcher
Group: None
Status: Open
Resolution: None
Priority: 7
Submitted By: Stefan Groschupf (joa23)
Assigned to: Doug Cutting (cutting)
Summary: host grouping
Initial Comment:
This patch group hosts per hit page.
If you like it, _please_ vote for it. ;)
The hits of one host are grouped _per page_ (! not over the
complete result set) in a <code>HostHit</code> object.
A HostHit object has at least 1+n Hit objects.
The patch provide a new API beside the old API.
HostHits = NutchBean.search(Query query, int numHits, int
hitsPerPage)
This patch allow you to realize different scenarios like.
+ show only one Hit from a host per Page
+ show all hits from a host below the hit with the highest score
and indent them, similar google does it
+ show one hit per host and show the urls of the other host hits
below
+ allow Users to switch on or off host grouping
+ much more
What ever you wish to do, you need to realize that in the jsp page
with the new method call and using HostHits, HotsHit and Hit.
Some code snippets to get an idea how you can do that:
HostHits hits = bean.search(query, start+hitsPerPage,
hitsPerPage);
HostHit[] show = hits.getHostHits(start, length)
...
if(hits.getTotal()<=start){
start = (int) (hits.getTotal()/hitsPerPage-0.49);
}
...
Hit mainHit = show[i].getHit(0);
HitDetails detail = bean.getDetails(mainHit);
String title = detail.getValue("title");
String url = detail.getValue("url");
String summary = bean.getSummary(detail,query);
...
int hostHitsCount = show[i].getHits().length;
if (hostHitsCount>1){
for (int j= 1; j<hostHitsCount; j++ ){
HitDetails hostHitDetail = bean.getDetails(show[i].getHit(j));
String hostHitUrl = hostHitDetail.getValue("url");
...
}
You need to add this to nutch-default.xml as well.
<property>
<name>search.page.raw.hits.factor</name>
<value>2</value>
<description>
A factor that is used to determinate the number of raw hits
initially fetched,
before a host grouping is done.
</description>
</property>
----------------------------------------------------------------------
>Comment By: Andrzej Bialecki (abial)
Date: 2004-07-13 12:52
Message:
Logged In: YES
user_id=32200
First of all, I think it's a good patch - I vote yes, with
some reservations explained below.
In NutchBean.search(): I'm not sure if there is any point to
call searcher.search() twice. If the original numHitsRaw was
too large, you won't get any different results when you
decrease it, you'll just get less results... the value you
pass in numHitsRaw is a "best effort" target anyway, i.e.
the searcher will try to return as many top hits as this
target, or less if less is available.
It is unfortunate that you have to get the HitDetails at
this stage (it's costly, especially when distributed), but
it can't be avoided if you want to know the host name.
However, I think that then instead of discarding this
valuable HitDetails data, you should stick it into HostHit
either together with the Hit, or instead of it (HitDetails
already contains segment and doc numbers, you can store the
score in a separate ArrayList). This way you can later get
the details directly from individual
HostHit.getDetails(index)/HostHit.getScore(index), without
calling the bean again and transmitting again the same
HitDetails from segments and possibly from other servers.
HostHit.countHostHits() is a misnomer - I suggest changing
it to simply getSize() or getCount(), or getLength().
I noticed you like to use LinkedList - I strongly suggest
using ArrayList instead in all places in this patch. For
pure "add(Object)" and "get(index)" operations ArrayList is
always much faster (~2 times) than LinkedList. This is even
more true for toArray() operation.
I suggest renaming the property from
"search.page.raw.hits.factor" to
"searcher.hostgrouping.rawhits.factor", to be more
consistent with naming of other properties.
Class-level and method-level comments contain some cruft -
missing javadoc information (to be completed), or foreign
RCS revision numbers (this needs to be removed, or formatted
in a way that prevents automatic expansion in our CVS).
----------------------------------------------------------------------
Comment By: holman (ericholman)
Date: 2004-07-13 05:54
Message:
Logged In: YES
user_id=1015664
I vote for it also. However, would ultimately like to see
grouping across the entire result set, rather than just per
page.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=989844&group_id=59548
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers