[ https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000979#comment-14000979 ]
Hudson commented on NUTCH-1772: ------------------------------- SUCCESS: Integrated in Nutch-trunk #2630 (See [https://builds.apache.org/job/Nutch-trunk/2630/]) NUTCH-1772 Injector does not need merging if no pre-existing crawldb (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1595137) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java > Injector does not need merging if no pre-existing crawldb > --------------------------------------------------------- > > Key: NUTCH-1772 > URL: https://issues.apache.org/jira/browse/NUTCH-1772 > Project: Nutch > Issue Type: Improvement > Components: injector > Affects Versions: 1.8 > Reporter: Julien Nioche > Fix For: 1.9 > > Attachments: NUTCH-1772-Logging&ErrorHandling.patch, NUTCH-1772.patch > > > The injector currently works as following : > * MapReduce job 1 - Mapper : converts input lines into CrawlDatum objects > with normalisation and filtering > * MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at > this stage > * MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + > output of previous job > * MapReducer job 2 - Reducer : deduplication > If there is no existing crawldb (which will often be the case at injection > time) we don't really need to do the second mapreduce job and could simply > take the output of the MR job #1 as CrawlDB provided that we do the > deduplication as part of the reduce step. > If there is a crawldb then the reduce step of the MR job #1 is not really > needed and we could have that step as map only. -- This message was sent by Atlassian JIRA (v6.2#6252)