[ https://issues.apache.org/jira/browse/NUTCH-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-396. ------------------------------- Resolution: Won't Fix > mergesegs sorts URLs, making segments useless for subsequent fetch > ------------------------------------------------------------------ > > Key: NUTCH-396 > URL: https://issues.apache.org/jira/browse/NUTCH-396 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.8 > Environment: Mac OS X 10.4.7 > Reporter: Doug Cook > Priority: Minor > > Mergesegs leaves the output segment in URL-sorted order. > This is a problem if the segment was just generated and not yet fetched - the > fetcher likes the URLs to be in essentially random order (sort by URL hash or > similar). If I fetch a segment created by mergesegs, my performance is > extremely poor since all URLs from a given host will be grouped together and > the per-host delays kill me. > I have a local fix which I am using: map using a key of MD5(URL) + URL, then, > during the reduce phase, chop the MD5 off the front to get the original URL. > This is simple, has essentially random order, no problems with collisions, > and seems to work nicely. > The only thing I don't know is whether or not there is some other tool > expecting the sorted order (I would expect not, since generate does not > produce this). Right now I have my fix as an option (-randomize), but if > there is no other tool requiring sorted order, it's probably cleaner to just > make this non-optional. > Thoughts? > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira