[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516673
 ] 

Andrzej Bialecki  commented on NUTCH-530:
-----------------------------------------

-1 from me.

See the recent discussion on Hadoop-dev - combiners simply re-use the same API 
as Reducer, but they follow a different semantics. The contract for a Combiner 
is that it could be run several times on the same data, so it should not have 
side effects on the data beyond mere aggregation of values. In our case, since 
we would re-use CrawlDbReducer as a combiner, we would do much more than a 
simple aggregation. Dogacan is right that ScoringFilters would be run twice, 
which may produce strange results. In addition to that, not all values may be 
present when a Combiner is run - combiners are run in the context of the 
current spill, which may not include all matching values even from the same 
input file.

Additionally, updatedb can be run with multiple segments - see the synopsis in 
CrawlDb.run().

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of 
> the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to