[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590080#comment-14590080
 ] 

Asitang Mishra commented on NUTCH-2038:
---------------------------------------

Have made a pull request for a rather uncouth patch. This initial patch is 
mainly to show the idea and get some reviews.


IDEA:
Two tier architecture for filtering:
The filter is called from the parser and looks at the current page that was 
parsed. Does a NB classification on the text of the page and decided if it is 
relevant or not. If relevant then let all the outlinks pass, if not then the 
second check kicks in, which checks for some "hotwords" in the outlink urls 
itself (from a wordlist provided by the user). If a match ten let it pass. 



HOW TO USE:
Activate the model filter in the plugin.includes property:

<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-(model|regex)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>
  </description>
</property>

You need to set some properties in the nutch-site.xml like :
<property>
  <name>parser.modelfilter.trainfile</name>
  <value>train/tweets-train.tsv</value>
  <description>
  </description>
</property>

<property>
  <name>parser.modelfilter.dictionaryfile</name>
  <value>wordlist.txt</value>
  <description>
  </description>
</property>

<property>
  <name>parser.modelfilter</name>
  <value>true</value>
  <description>
  </description>
</property>



TRAINING FILE:
Keep the training file in a "train" named folder in local. Keep the wordlist in 
the conf
The format of the training file is as follows:

1 21312123 I am feeling happy
1 34354646 how are you
0 35345435 can i get some coffee

these are tab \t seperates values in each line. 
<class/target--can be either 1(relevent) or 0(irrelevent)><TAB><Unique ID for 
each line--need to be given by the user><TAB><TEXT>



WORDLIST:

Can be a list of words one in each line like:

atmosphere
java
python





> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier (using a 
> model provided).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to