[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764293#action_12764293 ]
Hudson commented on NUTCH-707: ------------------------------ Integrated in Nutch-trunk #959 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/]) Generation of multiple segments in multiple runs returns only 1 segment. > Generation of multiple segments in multiple runs returns only 1 segment > ----------------------------------------------------------------------- > > Key: NUTCH-707 > URL: https://issues.apache.org/jira/browse/NUTCH-707 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.9.0 > Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. > Reporter: Michael Chan > Assignee: Andrzej Bialecki > Fix For: 1.1 > > Attachments: GeneratorDiff > > > To generate multiple segments, generator.update.crawldb is set to true and > -topN is defined to be the size of the segments. However, only one segment of > size N is generated. > For example, I've tried it with a db containing 10,000+ links according to > dump. When generator.update.crawldb is set to true and -topN is set to 5, > only 1 segment of size 5 is produced. > It seems to me the problem is due to an incorrect recording of generation > time. Selector.map assigns the generation time to each URL, even reduce only > collects N many. It's perfectly fine if the generator was run once and that > the db isn't updated. In the situation where the generator is run again > within genDelay, all the remaining URLs will be excluded. So, I suggest the > generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.