[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590080#comment-14590080 ]
Asitang Mishra edited comment on NUTCH-2038 at 6/17/15 4:51 PM: ---------------------------------------------------------------- Have made a pull request for a rather uncouth patch. This initial patch is mainly to show the idea and get some reviews. IDEA: Two tier architecture for filtering: The filter is called from the parser and looks at the current page that was parsed. Does a NB classification on the text of the page and decided if it is relevant or not. If relevant then let all the outlinks pass, if not then the second check kicks in, which checks for some "hotwords" in the outlink urls itself (from a wordlist provided by the user). If a match ten let it pass. HOW TO USE: Activate the model filter in the plugin.includes property: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-(model|regex)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description> </description> </property> You need to set some properties in the nutch-site.xml like : <property> <name>parser.modelfilter.trainfile</name> <value>train/tweets-train.tsv</value> <description> </description> </property> <property> <name>parser.modelfilter.dictionaryfile</name> <value>wordlist.txt</value> <description> </description> </property> <property> <name>parser.modelfilter</name> <value>true</value> <description> </description> </property> TRAINING FILE: Keep the training file in a "train" named folder in local. Keep the wordlist in the conf The format of the training file is as follows: 1 21312123 I am feeling happy 1 34354646 how are you 0 35345435 can i get some coffee these are tab \t seperates values in each line. [class/target--can be either 1(relevent) or 0(irrelevent)]<TAB>[Unique ID for each line--need to be given by the user]<TAB>[TEXT] WORDLIST: Can be a list of words one in each line like: atmosphere java python was (Author: asitang): Have made a pull request for a rather uncouth patch. This initial patch is mainly to show the idea and get some reviews. IDEA: Two tier architecture for filtering: The filter is called from the parser and looks at the current page that was parsed. Does a NB classification on the text of the page and decided if it is relevant or not. If relevant then let all the outlinks pass, if not then the second check kicks in, which checks for some "hotwords" in the outlink urls itself (from a wordlist provided by the user). If a match ten let it pass. HOW TO USE: Activate the model filter in the plugin.includes property: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-(model|regex)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description> </description> </property> You need to set some properties in the nutch-site.xml like : <property> <name>parser.modelfilter.trainfile</name> <value>train/tweets-train.tsv</value> <description> </description> </property> <property> <name>parser.modelfilter.dictionaryfile</name> <value>wordlist.txt</value> <description> </description> </property> <property> <name>parser.modelfilter</name> <value>true</value> <description> </description> </property> TRAINING FILE: Keep the training file in a "train" named folder in local. Keep the wordlist in the conf The format of the training file is as follows: 1 21312123 I am feeling happy 1 34354646 how are you 0 35345435 can i get some coffee these are tab \t seperates values in each line. <class/target--can be either 1(relevent) or 0(irrelevent)><TAB><Unique ID for each line--need to be given by the user><TAB><TEXT> WORDLIST: Can be a list of words one in each line like: atmosphere java python > Naive Bayes classifier based url filter > --------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A url filter that will filter out the urls (after the parsing stage, will > keep only those urls that contain some "hot words" provided again in a list.) > from that pages that are classified irrelevant by the classifier (using a > model provided). -- This message was sent by Atlassian JIRA (v6.3.4#6332)