[ 
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445999 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

> settings. I.e. if someone created a segment with high max # of outlinks, you 
> should still be able
> to read it and process all outlinks. If you enforce the max # during reading 
> you won't be able
> to process this data.

Yes i agree, but IMO we should also not store more than configured max # of 
links, now it seems we
store em all (or am i just not seeing it?).

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists about fetcher being 
> slow, this patch tried to address that. the patch is just a quich hack and 
> needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
> and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does not (a 
> decorator is provided that can do it and it should perhaps be used where http 
> headers are handled but in most of the cases the functionality is not 
> required)
> Reading/writing various data structures - patch tries to do io more 
> efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a 
> script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to