[ https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998586#comment-13998586 ]
Julien Nioche commented on NUTCH-1772: -------------------------------------- Thanks Diaa. I will have a look at it a bit later. For future reference, could you please generate the patches from the root of the Nutch SVN repo, see [https://wiki.apache.org/nutch/HowToContribute]? It makes it easier to apply and review them. > Injector does not need merging if no pre-existing crawldb > --------------------------------------------------------- > > Key: NUTCH-1772 > URL: https://issues.apache.org/jira/browse/NUTCH-1772 > Project: Nutch > Issue Type: Improvement > Components: injector > Affects Versions: 1.8 > Reporter: Julien Nioche > Attachments: NUTCH-1772-Logging&ErrorHandling.patch, NUTCH-1772.patch > > > The injector currently works as following : > * MapReduce job 1 - Mapper : converts input lines into CrawlDatum objects > with normalisation and filtering > * MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at > this stage > * MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + > output of previous job > * MapReducer job 2 - Reducer : deduplication > If there is no existing crawldb (which will often be the case at injection > time) we don't really need to do the second mapreduce job and could simply > take the output of the MR job #1 as CrawlDB provided that we do the > deduplication as part of the reduce step. > If there is a crawldb then the reduce step of the MR job #1 is not really > needed and we could have that step as map only. -- This message was sent by Atlassian JIRA (v6.2#6252)