[jira] [Commented] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

Ferdy Galema (JIRA) Wed, 18 Jul 2012 04:40:40 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417005#comment-13417005
 ]


Ferdy Galema commented on NUTCH-1431:
-------------------------------------

It is a way to keep the size of a crawl within limits, without making 
concessions to the various existing properties. We have a scenario where we 
want to crawl specific sites in a breadth-first manner. Previously we had to 
set topN to unlimited and make sure fetch cycles finish their entire batch. 
Problem is that the number of urls tend to grow really big after just a few 
cycles. The fetcher will need an unreasonable long time to run.

With this new option you can have small (in terms of topN/time limits) 
maintainable fetch iterations but still be able crawl within a specified number 
of hops. And this number is adjustable for every cycle: After a certain amount 
of iterations when all urls have been fetched (the output of the generator will 
be very small to empty), simply increase the limit by 1 to fetch a new set of 
urls that is one hop further.
                
> Introduce link 'distance' and add configurable max distance in the generator
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1431
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1431
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1431.patch
>
>
> Introducing a new feature that enables to crawl URLs within a specific 
> distance (shortest path) from the injected source urls. This is where the 
> db-updater of Nutchgora really shines. Because every url in the reducer has 
> all of its inlinks present, it is really easy to determine what the shortest 
> path is to that url. (I would not know how to cleanly implement this feature 
> for trunk).
> Injected urls have distance 0. Outlink urls on those pages have distance 1. 
> Outlinks on those pages have distance 2, etc. Outlinks that already had a 
> smaller distance will keep that distance. Of all inlinks to a page, it will 
> always select the smallest distance in order to maintain the shortest path 
> garantuee.
> Generator now has a property 'generate.max.distance' (default set to -1) that 
> specifies the maximum allowed distance of urls to select for fetch.
> Note that this is fundamentally different from the concept crawl 'depth'. 
> Depth is used for crawl cycles. Distance allows to crawl for unlimited number 
> of cycles AND always stay within a certain number of 'hops' from injected 
> urls.
> I will attach a patch. Will commit in a few days. (It does not change crawl 
> behaviour unless otherwise configured). Let me know if you have comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

Reply via email to