The plugin was registered, but it is not being called when ParserJob is being called.
2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:324 - Registered Plugins: 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - the nutch core extension points (nutch-extensionpoints) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Basic URL Normalizer (urlnormalizer-basic) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Html Parse Plug-in (parse-html) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Basic Indexing Filter (index-basic) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Http / Https Protocol Plug-in (protocol-httpclient) #################### This is my plugin ############################### 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Locale extractor Filter (localeextractor) ################# End My plugin ##################################### 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - HTTP Framework (lib-http) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Regex URL Filter (urlfilter-regex) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Pass-through URL Normalizer (urlnormalizer-pass) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Http Protocol Plug-in (protocol-http) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Regex URL Normalizer (urlnormalizer-regex) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Tika Parser Plug-in (parse-tika) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - OPIC Scoring Plug-in (scoring-opic) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - CyberNeko HTML Parser (lib-nekohtml) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Anchor Indexing Filter (index-anchor) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:330 - Regex URL Filter Framework (lib-regex-filter) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:334 - Registered Extension-Points: 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Parse Filter (org.apache.nutch.parse.ParseFilter) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Nutch Content Parser (org.apache.nutch.parse.Parser) 2014-09-11 00:57:43 INFO org.apache.nutch.plugin.PluginRepository:339 - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2014-09-11 00:57:43 DEBUG org.apache.nutch.plugin.PluginRepository:105 - After initialization = org.apache.nutch.plugin.PluginRepository@a840c483 -----Original Message----- From: Iqbal Shaikh [mailto:[email protected]] Sent: Thursday, September 11, 2014 1:41 AM To: [email protected] Subject: RE: Seeking help about running nutch jobs Hi Krishnanand, Am new to Nutch like yourself but can you try to increase the logging level of plugin in log4j.properties from: log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to: log4j.logger.org.apache.nutch.plugin.PluginRepository=INFO And see if your plugin is registered: 2014-09-11 09:36:40,493 INFO plugin.PluginRepository - Plugins: looking in: /home/../target/classes/plugins 2014-09-11 09:36:40,474 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2014-09-11 09:36:40,479 INFO plugin.PluginRepository - Registered Plugins: 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - ElasticIndexWriter (indexer-elastic) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Index Metadata (index-metadata) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Registered Extension-Points: 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2014-09-11 09:36:40,480 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - Nutch Index Writer (org.apache.nutch.indexer.IndexWriter) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2014-09-11 09:36:40,481 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) Hope that helps. Iqbal Shaikh ________________________________________ From: Krishnanand, Kartik [[email protected]] Sent: 11 September 2014 06:28 To: [email protected] Subject: Seeking help about running nutch jobs Hi, I am a nutch newbie and I would like to ask a few questions and I would appreciate any assistance. 1. I have written a plugin that extracts all the JavaScript URLs and returns them as outlinks. I would like to configure Nutch to take these outlinks and push these urls in the crawldb. Is there a way I can do that? If yes, I would like to know how I could do it. 2. How do I invoke this plugin? My logs show that the plugin is not invoked. My set up is as follows: Any advice would be gratefully appreciated. I am referring to http://florianhartl.com/nutch-how-it-works.html on what each job does. Thanks, Kartik <parse-plugins> <!-- by default if the mimeType is set to *, or if it can't be determined, use parse-tika --> . . . . . <mimeType name="text/html"> <plugin id="parse-html" /> <plugin id="localeextractor" /> </mimeType> . . . . <aliases> <alias name="parse-html" extension-id="org.apache.nutch.parse.html.HtmlParser" /> <alias name="parse-tika" extension-id="org.apache.nutch.parse.tika.TikaParser" /> <alias name="parse-ext" extension-id="ExtParser" /> <alias name="parse-js" extension-id="JSParser" /> <alias name="feed" extension-id="org.apache.nutch.parse.feed.FeedParser" /> <alias name="parse-swf" extension-id="org.apache.nutch.parse.swf.SWFParser" /> <alias name="parse-zip" extension-id="org.apache.nutch.parse.zip.ZipParser" /> <!-- This is my addition --> <alias name="localeextractor" extension-id="LocaleExtractorFilter" /> </aliases> </parse-plugins> # # ########### My plugin code ############# public class LocaleExtractorFilter implements Parser { @Override public Parse getParse(String url, WebPage page) { // TODO Auto-generated method stub String stringContent = Bytes.toString(page.getContent()); Set<Outlink> jsOutlinks = this.addUrlsToBeParsed(stringContent); return new Parse( page.getText().toString(), page.getTitle().toString(), jsOutlinks.toArray(new Outlink[0]), page.getParseStatus()); } private static final Pattern PATTERN_WITH_ASCII_QUOTES = Pattern.compile("^(?:.*?goto\\('(\\w+)'\\).*|.*?OOLPopUp\\('(.+?'\\)).*)$", Pattern.MULTILINE); private static final String REDIRECT = "/accounts/redirect.go?target="; /** * The implementation parses the URLs from the string content of HTML files. The URLs are of the * following format: * <ul> * <li>{@code fsdgoto} links, Example * {@code <a name='bill_pay' href='javascript:goto('billpay');'>Bill Pay * </a>} * </li></ul> * * @param stringContent from which multiple urls can be constructed */ Set<Outlink> addUrlsToBeParsed(String stringContent) { Set<Outlink> outlinks = new TreeSet<Outlink>(); Matcher matcher = PATTERN_WITH_ASCII_QUOTES.matcher(stringContent); while (matcher.find()) { String url = ""; try { url = new StringBuilder(REDIRECT).append( matcher.group(1) != null ? matcher.group(1) : matcher.group(2)).toString(); outlinks.add(new Outlink(url, "")); } catch (MalformedURLException mue) { LOG.warn("Error generating outlink urls for " + url, mue); } } } ############# Plugin.xml ############# <plugin id="localeextractor" name="Locale extractor Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="localeextractor"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="com.bofa.ecom.search.localeextractor" name="LocaleExtractor" point="org.apache.nutch.parse.Parser"> <implementation id="LocaleExtractorFilter" class="com.myproject.LocaleExtractorFilter" /> </extension> </plugin> ---------------------------------------------------------------------- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message. Transform is a trading division of Engine Partners UK LLP, a limited liability partnership registered in England & Wales with registered number OC365812. Our registered office is at 60 Great Portland Street, London W1W 7RT, United Kingdom. A list of our members is open for inspection at our registered office. ---------------------------------------------------------------------- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.

