[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

Markus Jelsma (Updated) (JIRA) Wed, 16 Nov 2011 04:01:20 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated NUTCH-1184:
---------------------------------

    Attachment: NUTCH-1185-1.5-9.patch
    
> Fetcher to parse and follow Nth degree outlinks
> -----------------------------------------------
>
>                 Key: NUTCH-1184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1184
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, 
> NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch, 
> NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch, 
> NUTCH-1185-1.5-6.patch, NUTCH-1185-1.5-7.patch, NUTCH-1185-1.5-8.patch, 
> NUTCH-1185-1.5-9.patch
>
>
> Fetcher improvements to parse and follow outlinks up to a specified depth. 
> The number of outlinks to follow can be decreased by depth using a divisor. 
> This patch introduces three new configuration directives:
> {code}
> <property>
>   <name>fetcher.follow.outlinks.depth</name>
>   <value>-1</value>
>   <description>(EXPERT)When fetcher.parse is true and this value is greater 
> than 0 the fetcher will extract outlinks
>   and follow until the desired depth is reached. A value of 1 means all 
> generated pages are fetched and their first degree
>   outlinks are fetched and parsed too. Be careful, this feature is in itself 
> agnostic of the state of the CrawlDB and does not
>   know about already fetched pages. A setting larger than 2 will most likely 
> fetch home pages twice in the same fetch cycle.
>   It is highly recommended to set db.ignore.external.links to true to 
> restrict the outlink follower to URL's within the same
>   domain. When disabled (false) the feature is likely to follow duplicates 
> even when depth=1.
>   A value of -1 of 0 disables this feature.
>   </description>
> </property>
> <property>
>   <name>fetcher.follow.outlinks.num.links</name>
>   <value>4</value>
>   <description>(EXPERT)The number of outlinks to follow when 
> fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
>   the total number of pages to fetch. This works with 
> fetcher.follow.outlinks.depth.divisor, by default settings the followed 
> outlinks
>   at depth 1 is 8, not 4.
>   </description>
> </property>
> <property>
>   <name>fetcher.follow.outlinks.depth.divisor</name>
>   <value>2</value>
>   <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per 
> fetcher.follow.outlinks.depth. This decreases the number
>   of outlinks to follow by increasing depth. The formula used is: outlinks = 
> floor(divisor / depth * num.links). This prevents
>   exponential growth of the fetch list.
>   </description>
> </property>
> {code}
> Please, do not use this unless you know what you're doing. This feature does 
> not consider the state of the CrawlDB nor does it consider generator settings 
> such as limiting the number of pages per (domain|host|ip) queue. It is not 
> polite to use this feature with high settings as it can fetch many pages from 
> the same domain including duplicates.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

Reply via email to