[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851710#action_12851710
]
Dmitry Lihachev commented on NUTCH-570:
---------------------------------------
Yeah, Otis. It's just an update so it applies to trunk.
> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
> Key: NUTCH-570
> URL: https://issues.apache.org/jira/browse/NUTCH-570
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Ned Rockson
> Assignee: Otis Gospodnetic
> Priority: Minor
> Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches
> (50-100M at a time). I found that the URLs generated are not optimal because
> they are simply randomized by a hash comparator. In one crawl on 24 machines
> it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by
> randomization, but in order to get optimal ordering, urls from the same host
> should be as far apart in the list as possible. So I wrote a series of 2
> map/reduces to optimize the ordering and for a list of 25M documents it takes
> about 10 minutes on our cluster. Right now I have it in its own class, but I
> figured it can go in Generator.java and just add a flag in nutch-default.xml
> determining if the user wants to use it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.