[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854665#action_12854665 ]
Otis Gospodnetic commented on NUTCH-570: ---------------------------------------- I'm tempted to close this issue as Won't Fix, because: * I have no way to test and verify this * nobody seems to be using this * this issue has only 2 votes and only 3 watchers * the original reporter mentioned he noticed only marginal speedups > Improvement of URL Ordering in Generator.java > --------------------------------------------- > > Key: NUTCH-570 > URL: https://issues.apache.org/jira/browse/NUTCH-570 > Project: Nutch > Issue Type: Improvement > Components: generator > Reporter: Ned Rockson > Assignee: Otis Gospodnetic > Priority: Minor > Attachments: GeneratorDiff.out, GeneratorDiff_v1.out > > > [Copied directly from my email to nutch-dev list] > Recently I switched to Fetcher2 over Fetcher for larger whole web fetches > (50-100M at a time). I found that the URLs generated are not optimal because > they are simply randomized by a hash comparator. In one crawl on 24 machines > it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I > had set with regular Fetcher.java this was at least 3 fold more time. > Anyway, I realized that the best situation for ordering can be approached by > randomization, but in order to get optimal ordering, urls from the same host > should be as far apart in the list as possible. So I wrote a series of 2 > map/reduces to optimize the ordering and for a list of 25M documents it takes > about 10 minutes on our cluster. Right now I have it in its own class, but I > figured it can go in Generator.java and just add a flag in nutch-default.xml > determining if the user wants to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.