[Nutch-dev] [ nutch-Bugs-989844 ] host grouping

SourceForge.net Wed, 14 Jul 2004 18:43:43 -0700

Bugs item #989844, was opened at 2004-07-12 20:32
Message generated for change (Comment added) made by ericholman
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=989844&group_id=59548


Category: searcher
Group: None
Status: Open
Resolution: None
Priority: 7
Submitted By: Stefan Groschupf (joa23)
Assigned to: Doug Cutting (cutting)
Summary: host grouping

Initial Comment:
This patch group hosts per hit page.

If you like it, _please_ vote for it. ;)

The hits of one host are grouped _per page_ (! not over the 
complete result set) in a <code>HostHit</code> object.
A HostHit object has at least 1+n Hit objects.

The patch provide a new API beside the old API. 
HostHits  = NutchBean.search(Query query, int numHits, int 
hitsPerPage)

This patch allow you to realize different scenarios like.

+ show only one Hit from a host per Page
+ show all hits from a host below the hit with the highest score 
and indent them, similar google does it
+ show one hit per host and show the urls of the other host hits 
below
+ allow Users to switch on or off host grouping 
+ much more

What ever you wish to do, you need to realize that in the jsp page 
with the new method call and using HostHits, HotsHit and Hit.

Some code snippets to get an idea how you can do that:
HostHits hits = bean.search(query, start+hitsPerPage, 
hitsPerPage);
HostHit[] show = hits.getHostHits(start, length)
...
if(hits.getTotal()<=start){
   start = (int) (hits.getTotal()/hitsPerPage-0.49);
   }  
...
 Hit mainHit = show[i].getHit(0);
 HitDetails detail = bean.getDetails(mainHit);
String title = detail.getValue("title");
 String url = detail.getValue("url");
 String summary = bean.getSummary(detail,query);
...

 int hostHitsCount = show[i].getHits().length;
  if (hostHitsCount>1){   
 for (int j= 1; j<hostHitsCount; j++ ){
 HitDetails hostHitDetail = bean.getDetails(show[i].getHit(j));
String hostHitUrl = hostHitDetail.getValue("url");
...

}

You need to add this to nutch-default.xml as well.

<property>
  <name>search.page.raw.hits.factor</name>
  <value>2</value>
  <description>
  A factor that is used to determinate the number of raw hits 
initially fetched, 
  before a host grouping is done.
  </description>
</property>


----------------------------------------------------------------------

Comment By: holman (ericholman)
Date: 2004-07-14 20:41

Message:
Logged In: YES 
user_id=1015664

>>> I disagree with Doug that we should use URL to get the 
>>> host name - the URL pattern for host name is pretty 
>>> straightforward, and we can do it manually. 


>> Right now Nutch is able to run as an exclusive (niche) 
>> search engine (which is where many opportunities exist in 
>> marketplace). However, if only pattern matching as you 
>> describe, will it not see these as the same host:
>> http://www.isp.com/myname
>> http://www.isp.com/yourname

>> These may actually be 2 separate entries (or host names)
>> in an exclusive/niche engine, and wouldn't want them 
>> grouped together.


> That's true - you mean the virtual hosts... However, even
> when you use URL class it doesn't help you. One way out of
> it would be to provide a config property containing a
> java.util.regex.Pattern expression to catch the host name...


Stefan, Is this going to be part of the patch? I didn't see any 
more discussion on it.

----------------------------------------------------------------------

Comment By: Stefan Groschupf (joa23)
Date: 2004-07-14 20:02

Message:
Logged In: YES 
user_id=396197

Thanks for you comments that are very helpful!!! It is the only chance to 
improve something. 

Just compare the lines of code and the lines of comment. ;)
At least my fault I will work hard to improve my communication to be 
more clearly. That is a known bug in my hard and software. 
>You could probably keep the pointer to the last "used up"
>hit in the user's session
I dont think that is a good idea since i have a lot of bad experimence 
with session variables. First you need to syncronize it in case you load 
balance tomcat. Next it is so slow. Next: mostly they are much longer 
cached then you need them. etc.pp.

Google: Does a kind of host filtering, they shows you hosthit 1 and 
hosthit 2 and filter any other hits from the same host but add a search 
for hits on this host. 

javadoc:
>After importing this into Nutch CVS this will get expanded
>to different values
I know that is the sense of this tags, isnt it. I really miss this kind of 
header in nutch source code since all os projects i was involved had such 
headers. 
I will improve the javadoc comments and rename the methods. You will 
get the new patch version soon. And just as a note i will improve it as 
many times as necessary to bring this patch in the cvs.  ;-)

After that i will continue with the jxm stuff. 
B-)

 

----------------------------------------------------------------------

Comment By: Andrzej Bialecki (abial)
Date: 2004-07-14 12:22

Message:
Logged In: YES 
user_id=32200

Hi Stefan,

>It looks like we still miss understand each other in a set of
>technical issues. 
>>Please note that this invokes searcher.search() exactly once
>>per page, and retrieves only the HitDetails for the hits on
>>the current page.
>
>Yes that is exactly what you need to do with the new api as
well.
>From my new jsp page:
> HostHits hits = bean.search(query, start+hitsPerPage,
hitsPerPage);
>The rest is still the same. So you only invoke this once per
page and only
>get the details including the summary one time per hit and
page!!!

I know, I didn't dispute this - I meant that within the
NutchBean.searchGroupedByPage() you may invoke
IndexSearcher.search() more than once. But see below...

>You are wrong! I group host per page but need every-time to
recalculate
> the grouping of all previous pages since you cant be that
you miss to
> show hits if you work with a offset url parameter.

Ok, now it's clear - my mistake, I misunderstood the reason
why you do it.

>You can't only analyze the hits for this page,
>since you do not know where the page start, where it ends
and on witch
>page you are.

Interesting... Now I'm really curious how others do this.
:-) Google, from what I can see, groups pages in the whole
result set, not just per page, but I dount they store the
intermediate grouped result - so maybe you are right (you
can never be sure, though, without looking at their code...
;-) )

You could probably keep the pointer to the last "used up"
hit in the user's session, or pass it through the query
string, but I agree this is a bit ugly (however, I can tell
you that I see it in a couple of commercial intranet search
engines).

>Please tell me how you know where the page does start
>pageNum*hitsPerPage will give a absolutly wrong value since
page 1 will
>show you 10 hits from different hosts but 23 hits at all. Do
>you get the problem?

Now I do - you are right, sorry to be so dense... All of
this would look very differently if only you could store
persistenly the latest "used up" pointer...

Regarding the caching of HitDetails - you don't seem to
understand my point either, but we can leave this subject as
a possible future optimization, and do it the way you do it
now. Ok?

>Can you please more specific about what you mean with  the
javadoc
>comments?

E.g.:

 * created on ${date}

 * @author $Author: sg $ (last edit)
 * @version $Revision: 1.2 $

After importing this into Nutch CVS this will get expanded
to different values - and I'm not sure even if CVS won't
hiccup on the revision number... No other file uses these
RCS macros, so I guess they shouldn't be added here...

The patch is missing javadoc for many methods, and missing
@param descriptions in many other places.

>>The new method should run searcher.search() just once
>As mentioned it does only runs this method one time!!!

No need to yell - the searcher.search() method within
searchGroupedByPage() WILL run more than once if the
rawHitFactor is too small. ;-)

Regarding the method name: you make it sound as if I'm
bothering you with trivialities. Maybe I am... - I was
merely pointing out that this sort of method normally has a
more logical name, like size() - the unusual name you gave
it makes it sound as if it was doing something which it
doesn't. It's better to correct it now, than do it later
(like I'm going to propose with ExtensionPoint.getExtentens()).

Please don't take my comments as negative - I'm trying to
figure out why you did it this way, and if at this moment
there is no other more elegant way to do it, this patch at
least works and can be fine-tuned later.


----------------------------------------------------------------------

Comment By: Andrzej Bialecki (abial)
Date: 2004-07-14 10:32

Message:
Logged In: YES 
user_id=32200

That's true - you mean the virtual hosts... However, even
when you use URL class it doesn't help you. One way out of
it would be to provide a config property containing a
java.util.regex.Pattern expression to catch the host name...

----------------------------------------------------------------------

Comment By: Stefan Groschupf (joa23)
Date: 2004-07-14 10:30

Message:
Logged In: YES 
user_id=396197

It looks like we still miss understand each other in a set of technical 
issues. 
>Please note that this invokes searcher.search() exactly once
>per page, and retrieves only the HitDetails for the hits on
>the current page.

Yes that is exactly what you need to do with the new api as well. 
>From my new jsp page:
 HostHits hits = bean.search(query, start+hitsPerPage, hitsPerPage);
The rest is still the same. So you only invoke this once per page and only 
get the details including the summary one time per hit and page!!!
>Your current patch tries to implement host grouping across
>all pages displayed so far.
You are wrong! I group host per page but need every-time to recalculate 
the grouping of all previous pages since you cant be that you miss to 
show hits if you work with a offset url parameter.

I really spend a lot of time to find a solution, by the way I'm sure google 
do it exactly the same way. You can't only analyze the hits for this page, 
since you do not know where the page start, where it ends and on witch 
page you are. Only chance is to use a unique document id that does not 
exist, see my mail or add a set of url parameters - that i think is not a 
good solution. Do you see such parameter by other search engines?

>With this assumption, I suggest the following changes (since
>the users will have to use different API anyway):
>* change the semantics of the search method from:
>     searchGroupedByPage(query, numHits, hitsPerPage)
>  to
>     searchGroupedByPage(query, pageNum, hitsPerPage)

Please tell me how you know where the page does start 
pageNum*hitsPerPage will give a absolutly wrong value since page 1 will 
show you 10 hits from different hosts but 23 hits at all. Do you get the 
problem?
>The new method should run searcher.search() just once
As mentioned it does only runs this method one time!!!

The method runs the method search only one time. The if clause there is 
in case you raw hit factor was to small if this happen to often (may be in 
case you run a company search ) you need to edit nutch-default.xml  

>more efficient bulk operation, to retrieve HitDetails
>for these Hit[].

I don't understand you the build method does as well only loop over Hit[] 
in case i would use that method i may would query to much details. 
Actually i only query details i need to query to group the host hits for this 
specific page.   

> Now it pays off to cache the HitDetails inside each HostHit,
You are again wrong a HostHit has many hits so it require to cache hit 
details and not hosthit details.  

> So, in the end, the end-user API example would look like this:
.. ;-/ Again hitdetails need to be cached in the  hit object and then you 
cache a lot of details you will not show. For example my ui only that 
there are more hits on this host. If you wish to just show all hits of the 
page then you have may 34 hits on one page and not 10.

>What do you think?
I think I improved 4 times the code before submitting the patch and you 
will run in the same problems in case you will re-implement it. 

Can you please more specific about what you mean with  the javadoc 
comments?
I can change the methods name since it looks like you find that 
important. 


----------------------------------------------------------------------

Comment By: holman (ericholman)
Date: 2004-07-14 09:56

Message:
Logged In: YES 
user_id=1015664

> I disagree with Doug that we should use URL to get the 
> host name - the URL pattern for host name is pretty 
> straightforward, and we can do it manually. 

Right now Nutch is able to run as an exclusive (niche) search 
engine (which is where many opportunities exist in 
marketplace). However, if only pattern matching as you 
describe, will it not see these as the same host:
http://www.isp.com/myname
http://www.isp.com/yourname

These may actually be 2 separate entries (or host names) in 
an exclusive/niche engine, and wouldn't want them grouped 
together.

Eric

----------------------------------------------------------------------

Comment By: Andrzej Bialecki (abial)
Date: 2004-07-14 05:40

Message:
Logged In: YES 
user_id=32200

The more I look at this patch the more I think that this
should be done in a different way.

Normally you would do the following to display a page of
results:

   Hits hits = bean.search(query, start + hitsPerPage);
   int end = (int)Math.min(hits.getTotal(), start +
hitsPerPage);
   int length = end-start;
   Hit[] show = hits.getHits(start, length);
   HitDetails[] details = bean.getDetails(show);
   String[] summaries = bean.getSummary(details, query);

(stolen from search.jsp)

Please note that this invokes searcher.search() exactly once
per page, and retrieves only the HitDetails for the hits on
the current page.

It would be best to keep it that way as much as possible
with the grouping code, to maintain similar performance.
Your current patch tries to implement host grouping across
all pages displayed so far (which is I think a laudable
goal), but I thought that "simple" meant that initially we
try to group only per display page... ;-)

With this assumption, I suggest the following changes (since
the users will have to use different API anyway):

* change the semantics of the search method from:
     searchGroupedByPage(query, numHits, hitsPerPage)
  to
     searchGroupedByPage(query, pageNum, hitsPerPage)

The new method should run searcher.search() just once, with
the appropriately calculated total number of hits so that
you get the new results at the tail, to use them for the
current display page grouping. And then, instead of
retrieving individual hits one by one you should get just
the Hit[] range, starting from the latest start position
(equal, I think, to pageNum * rawHitsFactor), and
hitsPerPage * rawHitsFactor long. This range is needed for
the grouping of the current display page only. Use also the
same, more efficient bulk operation, to retrieve HitDetails
for these Hit[].

Now it pays off to cache the HitDetails inside each HostHit,
because you only store a maximum of (rawHitsFactor *
hitsPerPage) of HitDetails - an acceptable compromise to
avoid getting the same HitDetails twice.

So, in the end, the end-user API example would look like this:

HostHits hits = bean.searchGroupedByPage(query, pageNum,
hitsPerPage);
HostHit[] show = hits.getHostHits(); // start and length
already set by the search method
for (int i = 0; i < show.length; i++) {
    Hit hit = show[i].getHit(0);
    HitDetails details = show[i].getDetails(0); // already
cached
    String title = details.getValue("title");
    String url = details.getValue("url");
...

and so on, pretty much as in your original example.

What do you think?

Also, I hate to sound picky, but you still didn't clean up
and fill in the javadoc comments in this patch, and also
IMHO you should rename the method countHostHits() to size(),
length(), getSize(), or somesuch... it doesn't really count
anything, it just returns the size. FWIW, I disagree with
Doug that we should use URL to get the host name - the URL
pattern for host name is pretty straightforward, and we can
do it manually. However, it's not just "http://"; as you
assume - with all the plugins we support now it could be
anything (e.g. "smb://" ;-) ). So, I think you should just
find the first "://" and then continue until the first "/".

----------------------------------------------------------------------

Comment By: Stefan Groschupf (joa23)
Date: 2004-07-13 19:39

Message:
Logged In: YES 
user_id=396197

Thanks for the comments!
Please find attached a improve version of the grouping patch. 
I use ArrayList now.
I retrieve the host now by usage of URL().getHost. 

To cache the details, would require change the Hit.java object from Doug. 
It could be easy done (Let me know if you wish this) but  I remember the 
words lets start simple. 
Again since  details are only queried 2 times for a bit more then 10 hits i 
prefer query it again instead of caching.

Host _grouping_ for all hits is difficult since it would require to scan all 
hits to just show the first page. 
I suggest to add another method that filter hosts. Then with the new site:
url query syntax it can just marked with something like <more hits from 
this host>

----------------------------------------------------------------------

Comment By: Stefan Groschupf (joa23)
Date: 2004-07-13 17:37

Message:
Logged In: YES 
user_id=396197

Andrzej,
thanks for your comments!
I will fix the problem with linkedlist.
You misunderstand something. You only need to call the method one 
time. I just leave the old method for compatibility issues. 
Right i need to get all details for all hits from 0-n where n is the last hit 
on the actually page but then then i only get the detail for m-n where m 
is the first hit on the actually page. So _of course_ you don't need to get 
the details for all hits 2 times. I was thinking about caching details as 
well but after some tests i decide against that since in case you browse 
lets say page 20 you need to cache 
20pages*(10hitsperPage+doubledHostHits) Remeber that we plan the 
provide a system can handle a set of users per minute so I think just 
read 10 + doubledHitsperHost a second time is fair enough. 
By the way I think the other searchengines do it in the same way, i was 
spending some time to analyze that. 

Doug,
>  I think the best way
>to do this is to track, with each page of hits, the last Hit
>that is displayed.
I was trying to add a pointer to last hits the search.jsp page but run in 
trouble until calculating the page numbers. Only chance I see is to use a 
set of url parameter like lastPage, lastHit etc. I dont think that is a good 
idea. 
Since I'm sure g**gle does it very similar i had done it. I spend a lot of 
time with different solution but the suggested is at least the one that is 
not toooo complex.

Finally, you already suggest to use URL and of course i had tested it:
The result:
url: 726
string: 18
So url is 40 times slower. ;-o
My test code:
 public static void main(String[] args) throws MalformedURLException {
        String urlStr = "http://sourceforge.net/tracker/?
func=detail&atid=491356&aid=989844&group_id=59548";
        long l = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++) {
            URL url2 = new URL(urlStr);
            String host = url2.getHost();
        }
        System.out.println("url: " + (System.currentTimeMillis() - l));
        final int ignoreStart = "http://".length();
        long t = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++) {
            int firstSlash = urlStr.indexOf("/", ignoreStart + 1);
            if (firstSlash < 1) {
                firstSlash = urlStr.length();
            }
            String host2 = urlStr.substring(ignoreStart, firstSlash);
        }
        System.out.println("string: " + (System.currentTimeMillis() - t));
    }

Does I may be oversee something?



----------------------------------------------------------------------

Comment By: Doug Cutting (cutting)
Date: 2004-07-13 11:48

Message:
Logged In: YES 
user_id=21778

I agree with Andrzej's comments.  I don't think we should
commit this until these are addressed.  Andrzej, do you have
time to work on this issue?

We must be very sparing of fetching hitDetails(), reading
only those we absolutely require, and reading them only
once.  The patch currently accesses these twice.  In large
deployments this can substantially impact performance.

We must also, when viewing subsequent pages, avoid accessing
details of hits from previous pages.  I think the best way
to do this is to track, with each page of hits, the last Hit
that is displayed, so that, when viewing the next page we
can start accessing details from that point.

Constructing the correct, de-duplicated results for
subsequent pages is tricky.  I think the correct behavior is
that a site which appeared on the first page can also appear
on the second page, so long as none of the hits shown for it
on the first page were shown on the second page.  I don't
think the current patch implemets this.

Some discussion of these issues can be found in the thread:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg01317.html

Finally, to extract the host from a URL string, I think we
should construct a java.net.URL and then access its host field.

----------------------------------------------------------------------

Comment By: Andrzej Bialecki (abial)
Date: 2004-07-13 05:52

Message:
Logged In: YES 
user_id=32200

First of all, I think it's a good patch - I vote yes, with
some reservations explained below.

In NutchBean.search(): I'm not sure if there is any point to
call searcher.search() twice. If the original numHitsRaw was
too large, you won't get any different results when you
decrease it, you'll just get less results... the value you
pass in numHitsRaw is a "best effort" target anyway, i.e.
the searcher will try to return as many top hits as this
target, or less if less is available.

It is unfortunate that you have to get the HitDetails at
this stage (it's costly, especially when distributed), but
it can't be avoided if you want to know the host name.
However, I think that then instead of discarding this
valuable HitDetails data, you should stick it into HostHit
either together with the Hit, or instead of it (HitDetails
already contains segment and doc numbers, you can store the
score in a separate ArrayList). This way you can later get
the details directly from individual
HostHit.getDetails(index)/HostHit.getScore(index), without
calling the bean again and transmitting again the same
HitDetails from segments and possibly from other servers.

HostHit.countHostHits() is a misnomer - I suggest changing
it to simply getSize() or getCount(), or getLength().

I noticed you like to use LinkedList - I strongly suggest
using ArrayList instead in all places in this patch. For
pure "add(Object)" and "get(index)" operations ArrayList is
always much faster (~2 times) than LinkedList. This is even
more true for toArray() operation.

I suggest renaming the property from
"search.page.raw.hits.factor" to
"searcher.hostgrouping.rawhits.factor", to be more
consistent with naming of other properties.

Class-level and method-level comments contain some cruft -
missing javadoc information (to be completed), or foreign
RCS revision numbers (this needs to be removed, or formatted
in a way that prevents automatic expansion in our CVS).

----------------------------------------------------------------------

Comment By: holman (ericholman)
Date: 2004-07-12 22:54

Message:
Logged In: YES 
user_id=1015664

I vote for it also. However, would ultimately like to see 
grouping across the entire result set, rather than just per 
page.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=989844&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [ nutch-Bugs-989844 ] host grouping

Reply via email to