[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173846#comment-13173846 ]
Hudson commented on NUTCH-1184: ------------------------------- Integrated in Nutch-trunk #1699 (See [https://builds.apache.org/job/Nutch-trunk/1699/]) Renamed FetcherStatus to FetcherOutlinks for the new outlinks section of NUTCH-1184 NUTCH-1184 Fetcher to parse and follow Nth degree outlinks markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221194 Files : * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221181 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java > Fetcher to parse and follow Nth degree outlinks > ----------------------------------------------- > > Key: NUTCH-1184 > URL: https://issues.apache.org/jira/browse/NUTCH-1184 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.5 > > Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, > NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch, > NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch, > NUTCH-1184-1.5-9-ParseOutputFormat.patch, NUTCH-1185-1.5-6.patch, > NUTCH-1185-1.5-7.patch, NUTCH-1185-1.5-8.patch, NUTCH-1185-1.5-9.patch > > > Fetcher improvements to parse and follow outlinks up to a specified depth. > The number of outlinks to follow can be decreased by depth using a divisor. > This patch introduces three new configuration directives: > {code} > <property> > <name>fetcher.follow.outlinks.depth</name> > <value>-1</value> > <description>(EXPERT)When fetcher.parse is true and this value is greater > than 0 the fetcher will extract outlinks > and follow until the desired depth is reached. A value of 1 means all > generated pages are fetched and their first degree > outlinks are fetched and parsed too. Be careful, this feature is in itself > agnostic of the state of the CrawlDB and does not > know about already fetched pages. A setting larger than 2 will most likely > fetch home pages twice in the same fetch cycle. > It is highly recommended to set db.ignore.external.links to true to > restrict the outlink follower to URL's within the same > domain. When disabled (false) the feature is likely to follow duplicates > even when depth=1. > A value of -1 of 0 disables this feature. > </description> > </property> > <property> > <name>fetcher.follow.outlinks.num.links</name> > <value>4</value> > <description>(EXPERT)The number of outlinks to follow when > fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply > the total number of pages to fetch. This works with > fetcher.follow.outlinks.depth.divisor, by default settings the followed > outlinks > at depth 1 is 8, not 4. > </description> > </property> > <property> > <name>fetcher.follow.outlinks.depth.divisor</name> > <value>2</value> > <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per > fetcher.follow.outlinks.depth. This decreases the number > of outlinks to follow by increasing depth. The formula used is: outlinks = > floor(divisor / depth * num.links). This prevents > exponential growth of the fetch list. > </description> > </property> > {code} > Please, do not use this unless you know what you're doing. This feature does > not consider the state of the CrawlDB nor does it consider generator settings > such as limiting the number of pages per (domain|host|ip) queue. It is not > polite to use this feature with high settings as it can fetch many pages from > the same domain including duplicates. > Also, this feature will _not_ work if fetcher.parse is disabled. With parsing > enabled you might want to consider not to store downloaded content. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira