[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel reopened NUTCH-2038: ------------------------------------ # unit test TestParserFactory fails:: {noformat} Testcase: testGetParsers took 0.892 sec Caused an ERROR null java.lang.NullPointerException at java.io.Reader.<init>(Reader.java:78) at java.io.BufferedReader.<init>(BufferedReader.java:94) at java.io.BufferedReader.<init>(BufferedReader.java:109) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:136) {noformat} # there are still unresolved library dependencies {noformat} % bin/nutch parsechecker .../test_naivebayes.html ... Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/Option at org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:93) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:148) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441) at org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:34) {noformat} All dependencies must be listed in plugin.xml, including inter-library dependencies (all jars in build/plugins/parsefilter-naivebayes/). > Naive Bayes classifier based html Parse filter (for filtering outlinks) > ----------------------------------------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A html parse filter that will filter out the outlinks in two stages. > Classify the parse text and decide if the parent page is relevant. If > relevant then don't filter the outlinks. If irrelevant then go thru each > outlink and see if the url contains any of the important words from a list. > If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)