[Nutch-dev] [ nutch-Bugs-989844 ] host grouping

SourceForge.net Wed, 14 Jul 2004 03:44:54 -0700

Bugs item #989844, was opened at 2004-07-13 03:32
Message generated for change (Comment added) made by abial
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=989844&group_id=59548


Category: searcher
Group: None
Status: Open
Resolution: None
Priority: 7
Submitted By: Stefan Groschupf (joa23)
Assigned to: Doug Cutting (cutting)
Summary: host grouping

Initial Comment:
This patch group hosts per hit page.

If you like it, _please_ vote for it. ;)

The hits of one host are grouped _per page_ (! not over the 
complete result set) in a <code>HostHit</code> object.
A HostHit object has at least 1+n Hit objects.

The patch provide a new API beside the old API. 
HostHits  = NutchBean.search(Query query, int numHits, int 
hitsPerPage)

This patch allow you to realize different scenarios like.

+ show only one Hit from a host per Page
+ show all hits from a host below the hit with the highest score 
and indent them, similar google does it
+ show one hit per host and show the urls of the other host hits 
below
+ allow Users to switch on or off host grouping 
+ much more

What ever you wish to do, you need to realize that in the jsp page 
with the new method call and using HostHits, HotsHit and Hit.

Some code snippets to get an idea how you can do that:
HostHits hits = bean.search(query, start+hitsPerPage, 
hitsPerPage);
HostHit[] show = hits.getHostHits(start, length)
...
if(hits.getTotal()<=start){
   start = (int) (hits.getTotal()/hitsPerPage-0.49);
   }  
...
 Hit mainHit = show[i].getHit(0);
 HitDetails detail = bean.getDetails(mainHit);
String title = detail.getValue("title");
 String url = detail.getValue("url");
 String summary = bean.getSummary(detail,query);
...

 int hostHitsCount = show[i].getHits().length;
  if (hostHitsCount>1){   
 for (int j= 1; j<hostHitsCount; j++ ){
 HitDetails hostHitDetail = bean.getDetails(show[i].getHit(j));
String hostHitUrl = hostHitDetail.getValue("url");
...

}

You need to add this to nutch-default.xml as well.

<property>
  <name>search.page.raw.hits.factor</name>
  <value>2</value>
  <description>
  A factor that is used to determinate the number of raw hits 
initially fetched, 
  before a host grouping is done.
  </description>
</property>


----------------------------------------------------------------------

>Comment By: Andrzej Bialecki (abial)
Date: 2004-07-14 12:40

Message:
Logged In: YES 
user_id=32200

The more I look at this patch the more I think that this
should be done in a different way.

Normally you would do the following to display a page of
results:

   Hits hits = bean.search(query, start + hitsPerPage);
   int end = (int)Math.min(hits.getTotal(), start +
hitsPerPage);
   int length = end-start;
   Hit[] show = hits.getHits(start, length);
   HitDetails[] details = bean.getDetails(show);
   String[] summaries = bean.getSummary(details, query);

(stolen from search.jsp)

Please note that this invokes searcher.search() exactly once
per page, and retrieves only the HitDetails for the hits on
the current page.

It would be best to keep it that way as much as possible
with the grouping code, to maintain similar performance.
Your current patch tries to implement host grouping across
all pages displayed so far (which is I think a laudable
goal), but I thought that "simple" meant that initially we
try to group only per display page... ;-)

With this assumption, I suggest the following changes (since
the users will have to use different API anyway):

* change the semantics of the search method from:
     searchGroupedByPage(query, numHits, hitsPerPage)
  to
     searchGroupedByPage(query, pageNum, hitsPerPage)

The new method should run searcher.search() just once, with
the appropriately calculated total number of hits so that
you get the new results at the tail, to use them for the
current display page grouping. And then, instead of
retrieving individual hits one by one you should get just
the Hit[] range, starting from the latest start position
(equal, I think, to pageNum * rawHitsFactor), and
hitsPerPage * rawHitsFactor long. This range is needed for
the grouping of the current display page only. Use also the
same, more efficient bulk operation, to retrieve HitDetails
for these Hit[].

Now it pays off to cache the HitDetails inside each HostHit,
because you only store a maximum of (rawHitsFactor *
hitsPerPage) of HitDetails - an acceptable compromise to
avoid getting the same HitDetails twice.

So, in the end, the end-user API example would look like this:

HostHits hits = bean.searchGroupedByPage(query, pageNum,
hitsPerPage);
HostHit[] show = hits.getHostHits(); // start and length
already set by the search method
for (int i = 0; i < show.length; i++) {
    Hit hit = show[i].getHit(0);
    HitDetails details = show[i].getDetails(0); // already
cached
    String title = details.getValue("title");
    String url = details.getValue("url");
...

and so on, pretty much as in your original example.

What do you think?

Also, I hate to sound picky, but you still didn't clean up
and fill in the javadoc comments in this patch, and also
IMHO you should rename the method countHostHits() to size(),
length(), getSize(), or somesuch... it doesn't really count
anything, it just returns the size. FWIW, I disagree with
Doug that we should use URL to get the host name - the URL
pattern for host name is pretty straightforward, and we can
do it manually. However, it's not just "http://"; as you
assume - with all the plugins we support now it could be
anything (e.g. "smb://" ;-) ). So, I think you should just
find the first "://" and then continue until the first "/".

----------------------------------------------------------------------

Comment By: Stefan Groschupf (joa23)
Date: 2004-07-14 02:39

Message:
Logged In: YES 
user_id=396197

Thanks for the comments!
Please find attached a improve version of the grouping patch. 
I use ArrayList now.
I retrieve the host now by usage of URL().getHost. 

To cache the details, would require change the Hit.java object from Doug. 
It could be easy done (Let me know if you wish this) but  I remember the 
words lets start simple. 
Again since  details are only queried 2 times for a bit more then 10 hits i 
prefer query it again instead of caching.

Host _grouping_ for all hits is difficult since it would require to scan all 
hits to just show the first page. 
I suggest to add another method that filter hosts. Then with the new site:
url query syntax it can just marked with something like <more hits from 
this host>

----------------------------------------------------------------------

Comment By: Stefan Groschupf (joa23)
Date: 2004-07-14 00:37

Message:
Logged In: YES 
user_id=396197

Andrzej,
thanks for your comments!
I will fix the problem with linkedlist.
You misunderstand something. You only need to call the method one 
time. I just leave the old method for compatibility issues. 
Right i need to get all details for all hits from 0-n where n is the last hit 
on the actually page but then then i only get the detail for m-n where m 
is the first hit on the actually page. So _of course_ you don't need to get 
the details for all hits 2 times. I was thinking about caching details as 
well but after some tests i decide against that since in case you browse 
lets say page 20 you need to cache 
20pages*(10hitsperPage+doubledHostHits) Remeber that we plan the 
provide a system can handle a set of users per minute so I think just 
read 10 + doubledHitsperHost a second time is fair enough. 
By the way I think the other searchengines do it in the same way, i was 
spending some time to analyze that. 

Doug,
>  I think the best way
>to do this is to track, with each page of hits, the last Hit
>that is displayed.
I was trying to add a pointer to last hits the search.jsp page but run in 
trouble until calculating the page numbers. Only chance I see is to use a 
set of url parameter like lastPage, lastHit etc. I dont think that is a good 
idea. 
Since I'm sure g**gle does it very similar i had done it. I spend a lot of 
time with different solution but the suggested is at least the one that is 
not toooo complex.

Finally, you already suggest to use URL and of course i had tested it:
The result:
url: 726
string: 18
So url is 40 times slower. ;-o
My test code:
 public static void main(String[] args) throws MalformedURLException {
        String urlStr = "http://sourceforge.net/tracker/?
func=detail&atid=491356&aid=989844&group_id=59548";
        long l = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++) {
            URL url2 = new URL(urlStr);
            String host = url2.getHost();
        }
        System.out.println("url: " + (System.currentTimeMillis() - l));
        final int ignoreStart = "http://".length();
        long t = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++) {
            int firstSlash = urlStr.indexOf("/", ignoreStart + 1);
            if (firstSlash < 1) {
                firstSlash = urlStr.length();
            }
            String host2 = urlStr.substring(ignoreStart, firstSlash);
        }
        System.out.println("string: " + (System.currentTimeMillis() - t));
    }

Does I may be oversee something?



----------------------------------------------------------------------

Comment By: Doug Cutting (cutting)
Date: 2004-07-13 18:48

Message:
Logged In: YES 
user_id=21778

I agree with Andrzej's comments.  I don't think we should
commit this until these are addressed.  Andrzej, do you have
time to work on this issue?

We must be very sparing of fetching hitDetails(), reading
only those we absolutely require, and reading them only
once.  The patch currently accesses these twice.  In large
deployments this can substantially impact performance.

We must also, when viewing subsequent pages, avoid accessing
details of hits from previous pages.  I think the best way
to do this is to track, with each page of hits, the last Hit
that is displayed, so that, when viewing the next page we
can start accessing details from that point.

Constructing the correct, de-duplicated results for
subsequent pages is tricky.  I think the correct behavior is
that a site which appeared on the first page can also appear
on the second page, so long as none of the hits shown for it
on the first page were shown on the second page.  I don't
think the current patch implemets this.

Some discussion of these issues can be found in the thread:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg01317.html

Finally, to extract the host from a URL string, I think we
should construct a java.net.URL and then access its host field.

----------------------------------------------------------------------

Comment By: Andrzej Bialecki (abial)
Date: 2004-07-13 12:52

Message:
Logged In: YES 
user_id=32200

First of all, I think it's a good patch - I vote yes, with
some reservations explained below.

In NutchBean.search(): I'm not sure if there is any point to
call searcher.search() twice. If the original numHitsRaw was
too large, you won't get any different results when you
decrease it, you'll just get less results... the value you
pass in numHitsRaw is a "best effort" target anyway, i.e.
the searcher will try to return as many top hits as this
target, or less if less is available.

It is unfortunate that you have to get the HitDetails at
this stage (it's costly, especially when distributed), but
it can't be avoided if you want to know the host name.
However, I think that then instead of discarding this
valuable HitDetails data, you should stick it into HostHit
either together with the Hit, or instead of it (HitDetails
already contains segment and doc numbers, you can store the
score in a separate ArrayList). This way you can later get
the details directly from individual
HostHit.getDetails(index)/HostHit.getScore(index), without
calling the bean again and transmitting again the same
HitDetails from segments and possibly from other servers.

HostHit.countHostHits() is a misnomer - I suggest changing
it to simply getSize() or getCount(), or getLength().

I noticed you like to use LinkedList - I strongly suggest
using ArrayList instead in all places in this patch. For
pure "add(Object)" and "get(index)" operations ArrayList is
always much faster (~2 times) than LinkedList. This is even
more true for toArray() operation.

I suggest renaming the property from
"search.page.raw.hits.factor" to
"searcher.hostgrouping.rawhits.factor", to be more
consistent with naming of other properties.

Class-level and method-level comments contain some cruft -
missing javadoc information (to be completed), or foreign
RCS revision numbers (this needs to be removed, or formatted
in a way that prevents automatic expansion in our CVS).

----------------------------------------------------------------------

Comment By: holman (ericholman)
Date: 2004-07-13 05:54

Message:
Logged In: YES 
user_id=1015664

I vote for it also. However, would ultimately like to see 
grouping across the entire result set, rather than just per 
page.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=989844&group_id=59548


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [ nutch-Bugs-989844 ] host grouping

Reply via email to