[ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-761. ----------------------------------- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki > Avoid cloningCrawlDatum in CrawlDbReducer > ------------------------------------------ > > Key: NUTCH-761 > URL: https://issues.apache.org/jira/browse/NUTCH-761 > Project: Nutch > Issue Type: Improvement > Reporter: Julien Nioche > Assignee: Andrzej Bialecki > Priority: Minor > Fix For: 1.1 > > Attachments: optiCrawlReducer.patch > > > In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its > reduce phase and these will be the entries coming from the crawlDB and not > present in the segments. > The patch attached optimizes the reduce step by avoid an unnecessary cloning > of the CrawlDatum fields when there is only one CrawlDatum in the values. > This has more impact has the crawlDB gets larger, we noticed an improvement > of around 25-30% in the time spent in the reduce phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.