[ 
https://issues.apache.org/jira/browse/NUTCH-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-396.
-------------------------------

    Resolution: Won't Fix

> mergesegs sorts URLs, making segments useless for subsequent fetch
> ------------------------------------------------------------------
>
>                 Key: NUTCH-396
>                 URL: https://issues.apache.org/jira/browse/NUTCH-396
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8
>         Environment: Mac OS X 10.4.7
>            Reporter: Doug Cook
>            Priority: Minor
>
> Mergesegs leaves the output segment in URL-sorted order.
> This is a problem if the segment was just generated and not yet fetched - the 
> fetcher likes the URLs to be in essentially random order (sort by URL hash or 
> similar). If I fetch a segment created by mergesegs, my performance is 
> extremely poor since all URLs from a given host will be grouped together and 
> the per-host delays kill me.
> I have a local fix which I am using: map using a key of MD5(URL) + URL, then, 
> during the reduce phase, chop the MD5 off the front to get the original URL. 
> This is simple, has essentially random order, no problems with collisions, 
> and seems to work nicely.
> The only thing I don't know is whether or not there is some other tool 
> expecting the sorted order (I would expect not, since generate does not 
> produce this). Right now I have my fix as an option (-randomize), but if 
> there is no other tool requiring sorted order, it's probably cleaner to just 
> make this non-optional.
> Thoughts?
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to