[ 
https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764293#action_12764293
 ] 

Hudson commented on NUTCH-707:
------------------------------

Integrated in Nutch-trunk #959 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/])
     Generation of multiple segments in multiple runs returns only 1 segment.


> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.1
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and 
> -topN is defined to be the size of the segments. However, only one segment of 
> size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to 
> dump. When generator.update.crawldb is set to true and -topN is set to 5, 
> only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation 
> time. Selector.map assigns the generation time to each URL, even reduce only 
> collects N many. It's perfectly fine if the generator was run once and that 
> the db isn't updated. In the situation where the generator is run again 
> within genDelay, all the remaining URLs will be excluded. So, I suggest the 
> generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to