Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewScoringIndexingExample

------------------------------------------------------------------------------
  bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/
  bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb
  
+ }}}
+ 
+ One thing to point out here is that WebGraph is meant to be used on larger 
web crawls to create web graphs.  By default it ignores outlinks to pages in 
the same domain, including subdomains, and pages with the same hostname.  It 
also limits to one outlink per page to links in the same page or the same 
domain.  All of these options are changeable through the following 
configuration options:
+ 
+ {{{
+ 
+ <!-- linkrank scoring properties -->
+ <property>
+   <name>link.ignore.internal.host</name>
+   <value>true</value>
+   <description>Ignore outlinks to the same hostname.</description>
+ </property>
+ 
+ <property>
+   <name>link.ignore.internal.domain</name>
+   <value>true</value>
+   <description>Ignore outlinks to the same domain.</description>
+ </property>
+ 
+ <property>
+   <name>link.ignore.limit.page</name>
+   <value>true</value>
+   <description>Limit to only a single outlink to the same page.</description>
+ </property>
+ 
+ <property>
+   <name>link.ignore.limit.domain</name>
+   <value>true</value>
+   <description>Limit to only a single outlink to the same 
domain.</description>
+ </property> 
+ 
+ }}}
+ 
+ But by default if you are only crawling pages within a domain or within a set 
of subdomains, all outlinks will be ignored and you will come up with an empty 
webgraph.  This in turn will throw an error while processing through the 
LinkRank job.  The flip side is by NOT ignoring links to the same domain/host 
and by not limiting those links, the webgraph becomes much, much more dense and 
hence there is a lot more links to process which probably won't affect 
relevancy as much.
+ 
+ {{{
  bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb 
crawl/crawldb -webgraphdb crawl/webgraphdb/

Reply via email to