[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516673 ]
Andrzej Bialecki commented on NUTCH-530: ----------------------------------------- -1 from me. See the recent discussion on Hadoop-dev - combiners simply re-use the same API as Reducer, but they follow a different semantics. The contract for a Combiner is that it could be run several times on the same data, so it should not have side effects on the data beyond mere aggregation of values. In our case, since we would re-use CrawlDbReducer as a combiner, we would do much more than a simple aggregation. Dogacan is right that ScoringFilters would be run twice, which may produce strange results. In addition to that, not all values may be present when a Combiner is run - combiners are run in the context of the current spill, which may not include all matching values even from the same input file. Additionally, updatedb can be run with multiple segments - see the synopsis in CrawlDb.run(). > Add a combiner to improve performance on updatedb > ------------------------------------------------- > > Key: NUTCH-530 > URL: https://issues.apache.org/jira/browse/NUTCH-530 > Project: Nutch > Issue Type: Improvement > Environment: java 1.6 > Reporter: Emmanuel Joke > Assignee: Emmanuel Joke > Fix For: 1.0.0 > > Attachments: NUTCH-530.patch > > > We have a lot of similar links with status "linked" generated at the ouput of > the map task when we try to update the crawldb based on the segment fetched. > We can use a combiner to improve the performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.