[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600419#comment-14600419 ]
Asitang Mishra commented on NUTCH-2038: --------------------------------------- " maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482" Will do that "is this statement still true? CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier." The first ever call to parse filter takes a bit more time because the training is done and model is created. So, time out should be a little more. Does not take much time after this. "afaics, the way the model is generated, stored and loaded needs a review: it should be read/generated once and then cached in memory, writing the model to disk is likely to become painful in distributed mode with concurrent tasks." The model is created during the parsing of the first fetched page of the very first parse job. After that it checks if the model file already present or not. The model file is being read each time the classify() function is called, will change that and store the model all the way thru for a single parse job. cosmetics: "exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() code formatting, see [1]" will do that > Naive Bayes classifier based html Parse filter (for filtering outlinks) > ----------------------------------------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A html parse filter that will filter out the outlinks in two stages. > Classify the parse text and decide if the parent page is relevant. If > relevant then don't filter the outlinks. If irrelevant then go thru each > outlink and see if the url contains any of the important words from a list. > If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)