[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Hudson (JIRA) Sat, 28 Nov 2009 06:10:09 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783239#action_12783239
 ]


Hudson commented on NUTCH-761:
------------------------------

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
    Fix a bug resulting from over-eager optimization in .
 Avoid cloning CrawlDatum in CrawlDbReducer.


> Avoid cloningCrawlDatum in CrawlDbReducer 
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
> reduce phase and these will be the entries coming from the crawlDB and not 
> present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning 
> of the CrawlDatum fields when there is only one CrawlDatum in the values. 
> This has more impact has the crawlDB gets larger,  we noticed an improvement 
> of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Reply via email to