[Nutch-dev] Patch to reduce whitespace in Summary

Andrzej Bialecki Sat, 15 May 2004 14:40:33 -0700

Hello,

Attached is a patch to reduce the amount of unnecessary whitespace in highlighted summaries. It contains two versions - commented out JDK 1.4 version (requires java.util.regex), and a probably less efficient method based on looping through the input String.

This patch also contains modifications to use custom highlight markup. This part is still to be completed with the changes in Summarizer/DistributedSearch/NutchBean - we need a discussion on where is the best place to put it. If we put it into the protocol for getSummary(), we take unnecessary hit for every request. If we put it into some config defaults initialized in Summary, we lose configurability (the client should be free to decide what markup to use, but in this case it can be only configured per installation, and not per request). Another option, which as I understand was considered and rejected, would be to return the Fragment[] from getSummary(). This would suit nicely, because the client could supply his own markup when converting Fragment's. Any other ideas?

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

Index: src/java/net/nutch/searcher/NutchBean.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/searcher/NutchBean.java,v
retrieving revision 1.8
diff -b -d -u -r1.8 NutchBean.java
--- src/java/net/nutch/searcher/NutchBean.java  4 May 2004 17:20:22 -0000       1.8
+++ src/java/net/nutch/searcher/NutchBean.java  15 May 2004 20:49:06 -0000
@@ -67,7 +67,7 @@
     IndexSearcher indexSearcher;
     if (indexDir.exists()) {
       LOG.info("opening merged index in " + indexDir.getCanonicalPath());
-      indexSearcher = new IndexSearcher(indexDir.toString());
+      indexSearcher = new IndexSearcher(indexDir.getCanonicalPath());
     } else {
       LOG.info("opening segment indexes in " + segmentsDir.getCanonicalPath());
       
Index: src/java/net/nutch/searcher/Summary.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/searcher/Summary.java,v
retrieving revision 1.2
diff -b -d -u -r1.2 Summary.java
--- src/java/net/nutch/searcher/Summary.java    12 Feb 2003 21:26:18 -0000      1.2
+++ src/java/net/nutch/searcher/Summary.java    15 May 2004 20:49:06 -0000
@@ -11,10 +11,32 @@
 
   /** A fragment of text within a summary. */
   public static class Fragment {
-    private String text;
+    public static final String DEFAULT_MARK_START = "<b>";
+    public static final String DEFAULT_MARK_END   = "</b>";
+    protected String markStart;
+    protected String markEnd;
+    protected String text;
 
     /** Constructs a fragment for the given text. */
-    public Fragment(String text) { this.text = text; }
+    public Fragment(String text) {
+        if (text != null) {
+          text = text.trim();
+          /* JDK 1.4 version instead of the loop below:
+             text = text.replaceAll("\\s+", " ");
+          */
+          StringBuffer tt = new StringBuffer();
+          int len = text.length();
+          char c = '\0';
+          for (int i = 0; i < len; i++) {
+             if (Character.isWhitespace(text.charAt(i))
+                    && Character.isWhitespace(c)) continue;
+             c = text.charAt(i);
+             tt.append(c);
+          }
+          text = tt.toString();
+        }
+       this.text = text;
+    }
 
     /** Returns the text of this fragment. */
     public String getText() { return text; }
@@ -31,26 +53,43 @@
 
   /** A highlighted fragment of text within a summary. */
   public static class Highlight extends Fragment {
+       
     /** Constructs a highlighted fragment for the given text. */
-    public Highlight(String text) { super(text); }
+    public Highlight(String text) {
+       super(text);
+       markStart = DEFAULT_MARK_START;
+       markEnd = DEFAULT_MARK_END;
+    }
+    
+    public Highlight(String text, String markStart, String markEnd) {
+       super(text);
+       this.markStart = markStart;
+       this.markEnd = markEnd;
+    }
 
     /** Returns true. */
     public boolean isHighlight() { return true; }
 
     /** Returns an HTML representation of this fragment. */
-    public String toString() { return "<b>" + super.toString() + "</b>"; }
+    public String toString() { return markStart + super.toString() + markEnd; }
   }
 
   /** An ellipsis fragment within a summary. */
   public static class Ellipsis extends Fragment {
     /** Constructs an ellipsis fragment for the given text. */
-    public Ellipsis() { super(" ... "); }
+    public Ellipsis(String markStart, String markEnd) {
+        super(markStart + " ... " + markEnd);
+    }
+
+    public Ellipsis() {
+        this(DEFAULT_MARK_START, DEFAULT_MARK_END);
+    }
 
     /** Returns true. */
     public boolean isEllipsis() { return true; }
 
     /** Returns an HTML representation of this fragment. */
-    public String toString() { return "<b> ... </b>"; }
+    public String toString() { return text; }
   }
 
   private ArrayList fragments = new ArrayList();
@@ -72,6 +111,7 @@
   public String toString() {
     StringBuffer buffer = new StringBuffer();
     for (int i = 0; i < fragments.size(); i++) {
+      if (i > 0) buffer.append(' ');
       buffer.append(fragments.get(i));
     }
     return buffer.toString();

[Nutch-dev] Patch to reduce whitespace in Summary

Reply via email to