[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167963#comment-15167963 ] Lewis John McGibbney commented on NUTCH-1712: - Is the Nutch codebase now acting off of Git? If so then we need to make an announcement of this somewhere. > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Fix For: 1.12 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167935#comment-15167935 ] ASF GitHub Bot commented on NUTCH-1712: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/86 > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167628#comment-15167628 ] ASF GitHub Bot commented on NUTCH-1712: --- GitHub user sebastian-nagel reopened a pull request: https://github.com/apache/nutch/pull/86 NUTCH-1712 Injector to use MultipleInputs (new MR API) Tested inject in combination with other CrawlDb tools (readdb, updatedb, mergedb): everything seems to work smoothly, although output files are part-0 or part-r-0 (for old resp. new MapReduce API). You can merge this pull request into a Git repository by running: $ git pull https://github.com/sebastian-nagel/nutch NUTCH-1712 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/86.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #86 commit 8900e4fb8b417f1f1e46f4dcb6c02840d2a5b838 Author: Sebastian Nagel Date: 2015-10-19T19:48:05Z NUTCH-1712 applied to current trunk; run first simple tests (inject + merge) commit 11942a92bd583eca8253e2b34f259f74c0ae4b81 Author: Sebastian Nagel Date: 2016-01-17T20:32:31Z add unit tests based on MRUnit commit 712b0b0ca2883fa399e23f7f22c9ffc236ec3db4 Author: Sebastian Nagel Date: 2016-01-17T21:20:32Z update tests to reflect change of reduce outputs by new API (part-n -> part-r-n): all unit tests pass now > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167601#comment-15167601 ] ASF GitHub Bot commented on NUTCH-1712: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/86 > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117963#comment-15117963 ] ASF GitHub Bot commented on NUTCH-1712: --- GitHub user sebastian-nagel opened a pull request: https://github.com/apache/nutch/pull/86 NUTCH-1712 Injector to use MultipleInputs (new MR API) Tested inject in combination with other CrawlDb tools (readdb, updatedb, mergedb): everything seems to work smoothly, although output files are part-0 or part-r-0 (for old resp. new MapReduce API). You can merge this pull request into a Git repository by running: $ git pull https://github.com/sebastian-nagel/nutch NUTCH-1712 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/86.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #86 > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093697#comment-15093697 ] Markus Jelsma commented on NUTCH-1712: -- Nice! > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092924#comment-15092924 ] Sebastian Nagel commented on NUTCH-1712: The merging is done together with minor improvements (https://github.com/apache/nutch/compare/trunk...sebastian-nagel:NUTCH-1712), but still need to adapt test unit (TestCrawlDbStates.java). > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702028#comment-14702028 ] Lewis John McGibbney commented on NUTCH-1712: - [~tejasp] we are in the process of addressing NUTCH-2049, are you interested in rebasing off of trunk and we can work to get this patch into trunk? > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 1.11 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935823#comment-13935823 ] Sebastian Nagel commented on NUTCH-1712: Thanks. Looks good in general, +1 for the nicely improved command-line help. Open points: # the resulting CrawlDb is not readable by some tools, e.g. {{nutch readdb crawldb/ -url url}} fails. Output should be MapFile not SequenceFile. Unluckily, o.a.h.mapreduce.lib.output.MapFileOutputFormat seems not available in Hadoop 1.2.0 (later versions contain the class, see MAPREDUCE-375) # URL normalizer scope could be changed by new config property "crawldb.url.normalizers.scope". Do we need it? If yes, should place a description into nutch-default.xml # if this property is not set: per default URLNormalizers.SCOPE_CRAWLDB is used instead of URLNormalizers.SCOPE_INJECT. Default should be still SCOPE_INJECT, right? > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880288#comment-13880288 ] Tejas Patil commented on NUTCH-1712: The performance gains due to this patch won't be phenomenal for small seeds file w/o any metadata and large crawldb's. The only savings with this patch is in terms of saving time over :- 1. dumping the output of the first job (ie. datum objects for the seed urls) 2. reading this output as input for the next job 3. job launch and cleanup. > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)