[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Andrzej Bialecki (JIRA) Wed, 25 Nov 2009 10:11:04 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrzej Bialecki  closed NUTCH-761.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Andrzej Bialecki 

> Avoid cloningCrawlDatum in CrawlDbReducer 
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
> reduce phase and these will be the entries coming from the crawlDB and not 
> present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning 
> of the CrawlDatum fields when there is only one CrawlDatum in the values. 
> This has more impact has the crawlDB gets larger,  we noticed an improvement 
> of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Reply via email to