Hi,
Below is a patch to IndexingFilters.java to avoid running duplicate filters. This may happen if you indavertently put multiple copies of plugins on the plugin.folders path list.
Since currently plugins don't follow a contract to add fields only once, if you run them more than once you will end up with Document's containing multiple fields with the same names and values - and this may badly affect the searching results.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
Index: IndexingFilters.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/indexer/IndexingFilters.java,v
retrieving revision 1.1
diff -b -d -u -r1.1 IndexingFilters.java
--- IndexingFilters.java 28 Jun 2004 21:26:35 -0000 1.1
+++ IndexingFilters.java 7 Jul 2004 16:13:39 -0000
@@ -3,6 +3,8 @@
package net.nutch.indexer;
+import java.util.HashMap;
+
import org.apache.lucene.document.Document;
import net.nutch.plugin.*;
@@ -20,11 +22,15 @@
if (point == null)
throw new RuntimeException(IndexingFilter.X_POINT_ID+" not found.");
Extension[] extensions = point.getExtentens();
- CACHE = new IndexingFilter[extensions.length];
+ HashMap filterMap = new HashMap();
for (int i = 0; i < extensions.length; i++) {
Extension extension = extensions[i];
- CACHE[i] = (IndexingFilter)extension.getExtensionInstance();
+ IndexingFilter filter = (IndexingFilter)extension.getExtensionInstance();
+ if (!filterMap.containsKey(filter.getClass().getName())) {
+ filterMap.put(filter.getClass().getName(), filter);
+ }
}
+ CACHE = (IndexingFilter[])filterMap.values().toArray(new IndexingFilter[0]);
} catch (PluginRuntimeException e) {
throw new RuntimeException(e);
}
@@ -36,7 +42,7 @@
public static Document filter(Document doc, Parse parse, FetcherOutput fo)
throws IndexingException {
- for (int i = 0 ; i < CACHE.length; i++) {
+ for (int i = 0; i < CACHE.length; i++) {
doc = CACHE[i].filter(doc, parse, fo);
}
