[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609639#comment-14609639 ]
Markus Jelsma edited comment on NUTCH-2038 at 7/1/15 6:50 AM: -------------------------------------------------------------- I have tried to search the comments here, but can anyone explain why Lucene and Mahout are in the core ivy xml when this is just a plugin? Also, if this is just about having some analyzers in the plugin, we don't need the Lucene core. was (Author: markus17): I have tried to search the comments here, but can anyone explain why lucene and mahout are in the core ivy xml when this is jus a plugin? > Naive Bayes classifier based html Parse filter (for filtering outlinks) > ----------------------------------------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A html parse filter that will filter out the outlinks in two stages. > Classify the parse text and decide if the parent page is relevant. If > relevant then don't filter the outlinks. If irrelevant then go thru each > outlink and see if the url contains any of the important words from a list. > If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)