[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593965#comment-14593965 ]
ASF GitHub Bot commented on NUTCH-2038: --------------------------------------- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/34#discussion_r32869372 --- Diff: conf/nutch-default.xml --- @@ -1259,6 +1259,34 @@ <!-- urlfilter plugin properties --> <property> + <name>urlfilter.model.trainfile</name> + <value></value> + <description>Set the name of the file to be used for Naive Bayes training. The format will be: +Each line contains two tab seperted parts +There are two columns/parts: +1. "1" or "0", "1" for relevent and "0" for irrelevent document. +3. Text (text that will be used for training) + +Each row will be considered a new "document" for the classifier. + + </description> +</property> + +<property> + <name>urlfilter.model.wordlist</name> + <value></value> + <description>Put the name of the file you want to be used as a list of "hot words" to be matched in the url for the model filter. The format should be one word per line. + </description> +</property> + +<property> + <name>urlfilter.model.filter</name> + <value>false</value> + <description>A boolean. Set it to true if using the model filter. --- End diff -- What does it mean to use the model filter (or not). What implications are there for (or for not) using it? > Naive Bayes classifier based url filter > --------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A url filter that will filter out the urls (after the parsing stage, will > keep only those urls that contain some "hot words" provided again in a list.) > from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)