[ https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferdy Galema closed NUTCH-1431. ------------------------------- Resolution: Fixed committed > Introduce link 'distance' and add configurable max distance in the generator > ---------------------------------------------------------------------------- > > Key: NUTCH-1431 > URL: https://issues.apache.org/jira/browse/NUTCH-1431 > Project: Nutch > Issue Type: New Feature > Reporter: Ferdy Galema > Fix For: 2.1 > > Attachments: NUTCH-1431.patch > > > Introducing a new feature that enables to crawl URLs within a specific > distance (shortest path) from the injected source urls. This is where the > db-updater of Nutchgora really shines. Because every url in the reducer has > all of its inlinks present, it is really easy to determine what the shortest > path is to that url. (I would not know how to cleanly implement this feature > for trunk). > Injected urls have distance 0. Outlink urls on those pages have distance 1. > Outlinks on those pages have distance 2, etc. Outlinks that already had a > smaller distance will keep that distance. Of all inlinks to a page, it will > always select the smallest distance in order to maintain the shortest path > garantuee. > Generator now has a property 'generate.max.distance' (default set to -1) that > specifies the maximum allowed distance of urls to select for fetch. > Note that this is fundamentally different from the concept crawl 'depth'. > Depth is used for crawl cycles. Distance allows to crawl for unlimited number > of cycles AND always stay within a certain number of 'hops' from injected > urls. > I will attach a patch. Will commit in a few days. (It does not change crawl > behaviour unless otherwise configured). Let me know if you have comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira