[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167601#comment-15167601 ]
ASF GitHub Bot commented on NUTCH-1712: --------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/86 > Use MultipleInputs in Injector to make it a single mapreduce job > ---------------------------------------------------------------- > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector > Affects Versions: 1.7 > Reporter: Tejas Patil > Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)