[ 
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445532 ] 
            
Andrzej Bialecki  commented on NUTCH-395:
-----------------------------------------

I have several comments to this patch:

* have you measured what made the biggest impact on performance - changes to 
Metadata, or changes to IO in FetcherOutput?

* I think it's a good idea to separate two concerns with PlainMetadata / 
MetadataSpellChecker. Since the latter is a subclass I think it would be more 
appropriate to name it SpellCheckedMetadata.

* I'd also argue for keeping the name Metadata and just replace the body of the 
class with PlainMetadata implementation - this way we could avoid changing the 
API in so many places; for compatibility we could just bump the version number 
in Metadata. We could then avoid also changes to version id-s of other classes 
that rely on Metadata, such as Content, ParseData et al.

* new Metadata / SpellCheckedMetadata need JUnit tests - this is important, 
because many other classes rely on proper working of these classes.

* Fetcher.VoidReducer is not needed - I'm guessing you wanted to use it just 
for logging.

* please observe formatting rules, especially whitespace rules - this patch 
doesn't follow them.

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists about fetcher being 
> slow, this patch tried to address that. the patch is just a quich hack and 
> needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
> and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does not (a 
> decorator is provided that can do it and it should perhaps be used where http 
> headers are handled but in most of the cases the functionality is not 
> required)
> Reading/writing various data structures - patch tries to do io more 
> efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a 
> script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to