Hello,
I am attaching the patch in "svn diff" format. I hope it is ok - I do not have a lot of experience with SVN so correct me if I am wrong.
The code looks surprisingly simple tome - maybe I have overlooked something. I was able to index two segments today one containing 120,000 of pages and the other one ~3mln. I had to change values for maxMergeDoc,mergeFactor,minMergeDocs that were set quite high in my installation as I was getting "too many open files" exceptions. It was to be expected as I have added a new field so number of open files during indexing increased. It should work without problems with default values of these parameters I think. I performed some basic searches using the same segment built with new and old code and I can see the differences I expected.I would be looking how it affected relevance and trying to choose good parameters for it this and next week.
As Doug and Andrzej suggested I prepared a basic patch. After accepting this patch we can think about extensions. So on my list currently I have:
- making boost values configurable from config file
- NutchDocumentAnalyzer - I was wondering why it is done this way but decided to keep current behavior for url (and so for host and title). As Andrzej suggests it shoudl be probably changed - I will look at details later - I would rather change one thing at a time.
-TITLE_BOOST - I personally think it should be lower than ANCHOR_BOOST but as it we make boost values configurable it will end all the discussion about it, so Andrzej would be able to set it high enough for his purposes.
After accepting the patch and doing some testing on real data I would like to think about adding feature suggested by Andrzej to differentiate :
1. "http://www.ikea.se/some/other/name.html"
3. "http://ikea.some.se/some/other/name.html"
and also about my ealier suggestions. Waiting for comments, Piotr
Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===================================================================
--- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
(revision 158818)
+++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
(working copy)
@@ -77,8 +77,9 @@
/** Returns a new token stream for text from the named field. */
public TokenStream tokenStream(String fieldName, Reader reader) {
Analyzer analyzer;
- if ("url".equals(fieldName) || ("anchor".equals(fieldName)))
- analyzer = ANCHOR_ANALYZER;
+ if ("url".equals(fieldName) || ("anchor".equals(fieldName))
+ || ("host".equals(fieldName)) || ("title".equals(fieldName)))
+ analyzer = ANCHOR_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
Index: src/java/org/apache/nutch/indexer/NutchSimilarity.java
===================================================================
--- src/java/org/apache/nutch/indexer/NutchSimilarity.java (revision
158818)
+++ src/java/org/apache/nutch/indexer/NutchSimilarity.java (working copy)
@@ -24,10 +24,10 @@
/** Normalize field by length. Called at index time. */
public float lengthNorm(String fieldName, int numTokens) {
- if ("url".equals(fieldName)) { // URL: prefer short
+ if ("url".equals(fieldName)||"host".equals(fieldName)) { //
URL: prefer short
return 1.0f / numTokens; // use linear normalization
- } else if ("anchor".equals(fieldName)) { // Anchor: prefer more
+ } else if ("anchor".equals(fieldName)||"title".equals(fieldName)) {
// Anchor: prefer more
return (float)(1.0/Math.log(Math.E+numTokens)); // use log
} else if ("content".equals(fieldName)) { // Content: penalize short
Index:
src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java
===================================================================
---
src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java
(revision 158818)
+++
src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java
(working copy)
@@ -36,15 +36,22 @@
private static float URL_BOOST = 4.0f;
private static float ANCHOR_BOOST = 2.0f;
+ private static float TITLE_BOOST = 1.5f;
+ private static float HOST_BOOST = 2.0f;
private static int SLOP = Integer.MAX_VALUE;
private static float PHRASE_BOOST = 1.0f;
- private static final String[] FIELDS = {"url", "anchor", "content"};
- private static final float[] FIELD_BOOSTS = {URL_BOOST, ANCHOR_BOOST, 1.0f};
+ private static final String[] FIELDS = { "url", "anchor", "content",
+ "title", "host" };
- /** Set the boost factor for url matches, relative to content and anchor
- * matches */
+ private static final float[] FIELD_BOOSTS = { URL_BOOST, ANCHOR_BOOST,
+ 1.0f, TITLE_BOOST, HOST_BOOST };
+
+ /**
+ * Set the boost factor for url matches, relative to content and anchor
+ * matches
+ */
public static void setUrlBoost(float boost) { URL_BOOST = boost; }
/** Set the boost factor for title/anchor matches, relative to url and
Index:
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
===================================================================
---
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
(revision 158818)
+++
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
(working copy)
@@ -27,6 +27,8 @@
import org.apache.nutch.fetcher.FetcherOutput;
import org.apache.nutch.pagedb.FetchListEntry;
+import java.net.MalformedURLException;
+import java.net.URL;
import java.util.logging.Logger;
import org.apache.nutch.util.LogFormatter;
import org.apache.nutch.util.NutchConf;
@@ -43,6 +45,15 @@
throws IndexingException {
String url = fo.getUrl().toString();
+ String hostname = null;
+ try {
+ URL u = new URL(url);
+ hostname = u.getHost();
+ } catch (MalformedURLException e) {
+ //ignore hostname if url is malformed
+ }
+ if (hostname!=null)
+ doc.add(Field.UnStored("host", hostname));
// url is both stored and indexed, so it's both searchable and returned
doc.add(Field.Text("url", url));
@@ -61,10 +72,8 @@
if (title.length() > MAX_TITLE_LENGTH) { // truncate title if needed
title = title.substring(0, MAX_TITLE_LENGTH);
}
- // add title as anchor so it is searchable. doesn't warrant its own field.
- doc.add(Field.UnStored("anchor", title));
- // add title unindexed, so that it can be displayed
- doc.add(Field.UnIndexed("title", title));
+ // add title indexed and stored so that it can be displayed
+ doc.add(Field.Text("title", title));
return doc;
}
