[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605120#comment-14605120 ]
Chris A. Mattmann commented on NUTCH-2038: ------------------------------------------ Tests fail in TestParserFactory: {noformat} org/apache/commons/cli2/Option java.lang.NoClassDefFoundError: org/apache/commons/cli2/Option at org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:93) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:148) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441) at org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:34) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:244) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) at org.apache.nutch.parse.TestParserFactory.testGetParsers(TestParserFactory.java:63) {noformat} I'll fix it. > Naive Bayes classifier based html Parse filter (for filtering outlinks) > ----------------------------------------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A html parse filter that will filter out the outlinks in two stages. > Classify the parse text and decide if the parent page is relevant. If > relevant then don't filter the outlinks. If irrelevant then go thru each > outlink and see if the url contains any of the important words from a list. > If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)