Hi nutch users,

as discussed a patch to allow host grouping and or filtering.
The hits of one host are grouped _per page_ (! not over the complete result set) in a <code>HostHit</code> object.
A HostHit object has at least 1+n Hit objects.

The patch provide a new API beside the old API.
HostHits = NutchBean.search(Query query, int numHits, int hitsPerPage)

This patch allow you to realize different scenarios like.

+ show only one Hit from a host per Page
+ show all hits from a host below the hit with the highest score and indent them, similar google does it
+ show one hit per host and show the urls of the other host hits below
+ allow Users to switch on or off host grouping
+ much more

What ever you wish to do, you need to realize that in the jsp page with the new method call and using HostHits, HotsHit and Hit.

Some code snippets to get an idea how you can do that:
HostHits hits = bean.search(query, start+hitsPerPage, hitsPerPage);
HostHit[] show = hits.getHostHits(start, length)
...
if(hits.getTotal()<=start){
start = (int) (hits.getTotal()/hitsPerPage-0.49);
}
...
Hit mainHit = show[i].getHit(0);
HitDetails detail = bean.getDetails(mainHit);
String title = detail.getValue("title");
String url = detail.getValue("url");
String summary = bean.getSummary(detail,query);
...

int hostHitsCount = show[i].getHits().length;
if (hostHitsCount>1){
for (int j= 1; j<hostHitsCount; j++ ){
HitDetails hostHitDetail = bean.getDetails(show[i].getHit(j));
String hostHitUrl = hostHitDetail.getValue("url");
...

}



Index: NutchBean.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/searcher/NutchBean.java,v
retrieving revision 1.8
diff -u -r1.8 NutchBean.java
--- NutchBean.java      4 May 2004 17:20:22 -0000       1.8
+++ NutchBean.java      4 Jul 2004 22:51:50 -0000
@@ -6,8 +6,11 @@
 import java.io.IOException;
 import java.io.File;
 
-import java.net.InetSocketAddress;
 
+
+import java.util.HashMap;
+
+import java.util.LinkedList;
 import java.util.Vector;
 import net.nutch.indexer.IndexSegment;
 import java.util.logging.Logger;
@@ -36,6 +39,8 @@
   private HitSummarizer summarizer;
   private HitContent content;
 
+  private int rawHitsFactor;
+
   /** Cache in servlet context. */
   public static NutchBean get(ServletContext app) throws IOException {
     NutchBean bean = (NutchBean)app.getAttribute("nutchBean");
@@ -96,6 +101,7 @@
     this.detailer = indexSearcher;
     this.summarizer = segments;
     this.content = segments;
+    this.rawHitsFactor = NutchConf.getInt("search.page.raw.hits.factor", 2);
   }
 
   private void init(DistributedSearch.Client client) throws IOException {
@@ -114,6 +120,77 @@
   public Hits search(Query query, int numHits) throws IOException {
     return searcher.search(query, numHits);
   }
+  
+  /**
+   * Group hits from the same host per page and returns a <code>HostHits</code>
+   * objects to access them.
+   * 
+   * @param query
+   * @param numHits
+   * @param hitsPerPage
+   * @return Returns HostHits that holds a set of HostHit objects
+   * @throws IOException
+   */
+    public HostHits search(Query query, int numHits, int hitsPerPage)
+            throws IOException {
+
+        int numHitsRaw = numHits + (hitsPerPage * rawHitsFactor);
+        Hits hits = searcher.search(query, numHitsRaw);
+        LinkedList mergedHitsList = new LinkedList();
+        HashMap existingHosts = new HashMap();
+
+        int maxLoops = (int) Math.min(numHits, hits.getTotal());
+        int i = 0;
+        int page = 0;
+        int groupedHits = 0;
+        while (mergedHitsList.size() < maxLoops) {
+            if (i % hitsPerPage == 0) {
+                page = page + 1;
+            }
+            if (hits.getTotal() <= i) {
+                break;
+            }
+            if (hits.getLength() <= i) {
+                numHitsRaw = i + (hitsPerPage * rawHitsFactor);
+                hits = searcher.search(query, numHitsRaw);
+            }
+
+            Hit hit = hits.getHit(i);
+            String hostPagekey = getHost(hit) + page;
+
+            if (existingHosts.containsKey(hostPagekey)) {
+                ((HostHit) existingHosts.get(hostPagekey)).addHit(hit);
+                i++;
+                groupedHits = groupedHits + 1;
+                continue;
+            }
+            HostHit hostHit = new HostHit(hit);
+            existingHosts.put(hostPagekey, hostHit);
+            mergedHitsList.add(hostHit);
+            i++;
+        }
+
+        return new HostHits((HostHit[]) mergedHitsList
+                .toArray(new HostHit[mergedHitsList.size()]),
+                (hits.getTotal() - groupedHits));
+    }
+  
+  /**
+   * 
+   * @param hit
+   * @return Returns the host of a hit.
+   * @throws IOException
+   */
+  private String getHost(Hit hit) throws IOException {
+      final int ignoreStart = "http://".length();
+      HitDetails detail = getDetails(hit);
+      String host = detail.getValue("url");
+      int firstSlash = host.indexOf("/", ignoreStart + 1);
+      if (firstSlash < 1) {
+          firstSlash = host.length();
+      }
+      return host.substring(ignoreStart, firstSlash);
+  }
 
   public String getExplanation(Query query, Hit hit) throws IOException {
     return searcher.getExplanation(query, hit);
Index: HostHit.java
===================================================================
RCS file: HostHit.java
diff -N HostHit.java
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ HostHit.java        1 Jan 1970 00:00:00 -0000
@@ -0,0 +1,54 @@
+/* Copyright (c) 2003 The Nutch Organization.  All rights reserved.   */
+/* Use subject to the conditions in http://www.nutch.org/LICENSE.txt. */
+
+package net.nutch.searcher;
+
+import java.util.LinkedList;
+
+/**
+ * created on ${date}
+ * 
+ * @author Stefan Groschupf
+ * @author $Author: sg $ (last edit)
+ * @version $Revision: 1.2 $
+ * 
+ * A Dataobject that holds all hits of one host perPage.
+ */
+public class HostHit {
+
+    private LinkedList fHits;
+
+    /**
+     * @param hit
+     */
+    public HostHit(Hit hit) {
+        fHits = new LinkedList();
+        fHits.add(hit);
+    }
+
+    /**
+     * @param hit
+     */
+    public void addHit(Hit hit) {
+        fHits.add(hit);
+    }
+
+    public Hit getHit(int i) {
+        return (Hit) fHits.get(i);
+    }
+
+    /**
+     * @return Returns the number of hits in this host
+     */
+    public int countHostHits() {
+        return fHits.size();
+    }
+
+    /**
+     * @return Returns the hits of aa host;
+     */
+    public Hit[] getHits() {
+        return (Hit[]) fHits.toArray(new Hit[fHits.size()]);
+    }
+
+}
Index: HostHits.java
===================================================================
RCS file: HostHits.java
diff -N HostHits.java
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ HostHits.java       1 Jan 1970 00:00:00 -0000
@@ -0,0 +1,84 @@
+/*
+ * Copyright (c) 2003 by media style GmbH
+ * 
+ * @author Stefan Groschupf $Id: PrintCom_codetemplates.xml,v 1.1 2004/01/27
+ * 11:19:10 pcomp134 Exp Configuration.java,v 1.0 12.06.2004 21:53:42 Stefan
+ * Groschupf Exp $ $Source:
+ * /cvsroot/open2/gui/documentation/templates/PrintCom_codetemplates.xml,v
+ * com.ms.newsalert/Configuration.java,v $
+ *  
+ */
+package net.nutch.searcher;
+
+/**
+ * created on ${date}
+ * 
+ * @author Stefan Groschupf
+ * @author $Author: sg $ (last edit)
+ * @version $Revision: 1.2 $
+ * 
+ * Data objects that holds a set of hitHosts
+ */
+public class HostHits {
+
+
+    private HostHit[] fHostHits;
+
+    private long fTotalHits;
+
+    /**
+     * Constructor req
+     * 
+     * @param unfilteredPosition
+     *            A pointer where we stoped to analyse hits for doublicated
+     *            hosts.
+     * @param hostHits
+     *            unique host hits.
+     * 
+     *  
+     */
+    public HostHits(HostHit[] hostHits,  long totalHits) {
+        super();
+        fHostHits = hostHits;
+        fTotalHits = totalHits;
+    }
+
+    /**
+     * 
+     * @param i
+     * @return Returns a <code>HostHit</code>
+     */
+    public HostHit getHostHit(int i) {
+        return fHostHits[i];
+    }
+
+    /**
+     * 
+     * @param start
+     * @param length
+     * @return Returns a subset of the <code>HostHit</code> s objects.
+     */
+    public HostHit[] getHostHits(int start, int length) {
+        HostHit[] results = new HostHit[length];
+        for (int i = 0; i < length; i++) {
+            results[i] = fHostHits[start + i];
+        }
+        return results;
+    }
+
+    /**
+     * 
+     * @return Returns the number of HostHits included in this list.
+     */
+    public int getLength() {
+        return fHostHits.length;
+    }
+
+    /**
+     * 
+     * @return Returns the total number of hits which matched the query.
+     */
+    public long getTotal() {
+        return fTotalHits;
+    }
+}


You need to add this to nutch-default.xml as well.

<x-tad-bigger><property>
<name>search.page.raw.hits.factor</name>
<value>2</value>
<description>
A factor that is used to determinate the number of raw hits initially fetched,
before a host grouping is done.
</description>
</property>



[advertising]
</x-tad-bigger>
At least as you may be already guess, my remembering to donate for something real helpful in this small world that comes with all my contributions.
http://www.unicef.org/support/index.html
Please donate for children that dies because of to less eatables and keep daily in mind that a small linux server coast more then eatables for such a children for the next 4 years!!!!


I hope someone will find my patch useful.
Comments and suggestions are welcome.

Best,
Stefan


---------------------------------------------------------------
enterprise information technology consulting
open technology: http://www.media-style.com
open source: http://www.weta-group.net
open discussion: http://www.text-mining.org

Reply via email to