[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606348#comment-14606348 ]
ASF GitHub Bot commented on NUTCH-2038: --------------------------------------- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/41 > Naive Bayes classifier based html Parse filter (for filtering outlinks) > ----------------------------------------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A html parse filter that will filter out the outlinks in two stages. > Classify the parse text and decide if the parent page is relevant. If > relevant then don't filter the outlinks. If irrelevant then go thru each > outlink and see if the url contains any of the important words from a list. > If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)