Ferdy Galema created NUTCH-1431:
-----------------------------------

             Summary: Introduce link 'distance' and add configurable max 
distance in the generator
                 Key: NUTCH-1431
                 URL: https://issues.apache.org/jira/browse/NUTCH-1431
             Project: Nutch
          Issue Type: New Feature
            Reporter: Ferdy Galema
             Fix For: 2.1


Introducing a new feature that enables to crawl URLs within a specific distance 
(shortest path) from the injected source urls. This is where the db-updater of 
Nutchgora really shines. Because every url in the reducer has all of its 
inlinks present, it is really easy to determine what the shortest path is to 
that url. (I would not know how to cleanly implement this feature for trunk).

Injected urls have distance 0. Outlink urls on those pages have distance 1. 
Outlinks on those pages have distance 2, etc. Outlinks that already had a 
smaller distance will keep that distance. Of all inlinks to a page, it will 
always select the smallest distance in order to maintain the shortest path 
garantuee.

Generator now has a property 'generate.max.distance' (default set to -1) that 
specifies the maximum allowed distance of urls to select for fetch.

Note that this is fundamentally different from the concept crawl 'depth'. Depth 
is used for crawl cycles. Distance allows to crawl for unlimited number of 
cycles AND always stay within a certain number of 'hops' from injected urls.

I will attach a patch. Will commit in a few days. (It does not change crawl 
behaviour unless otherwise configured). Let me know if you have comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to