I've written a really small patch for org.apache.nutch.crawl.Injector
which allows the plugin author to force the injected url to overwrite
any existing url. I've not submitted anything to JIRA before, is this
worth it and if so, how should I go about it?
Index: src/java/org/apache/nutch/crawl/Injector.java
===================================================================
--- src/java/org/apache/nutch/crawl/Injector.java (release version)
+++ src/java/org/apache/nutch/crawl/Injector.java (patched version)
@@ -41,8 +41,8 @@
* crawled. Useful for bootstrapping the system. */
public class Injector extends ToolBase {
public static final Log LOG = LogFactory.getLog(Injector.class);
+ public static final Text OVERWRITE_INJECT = new
Text("nutch.crawl.overrideInject");
-
/** Normalize and filter injected urls. */
public static class InjectMapper implements Mapper {
private URLNormalizers urlNormalizers;
@@ -116,9 +116,17 @@
old = val;
}
}
+
+ boolean isOverwrite = false;
+ if(injected!=null)
+ if(injected.getMetaData().containsKey(Injector.OVERWRITE_INJECT))
+ isOverwrite =
((BooleanWritable)injected.getMetaData().get(Injector.OVERWRITE_INJECT)).get();
+
CrawlDatum res = null;
- if (old != null) res = old; // don't overwrite existing value
- else res = injected;
+ if ( old != null && !isOverwrite )
+ res = old; // don't overwrite existing value
+ else
+ res = injected;
output.collect(key, res);
}
Cheers
Rob
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers