[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-02-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167963#comment-15167963
 ] 

Lewis John McGibbney commented on NUTCH-1712:
-

Is the Nutch codebase now acting off of Git? If so then we need to make an 
announcement of this somewhere.

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Fix For: 1.12
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167935#comment-15167935
 ] 

ASF GitHub Bot commented on NUTCH-1712:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/86


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167628#comment-15167628
 ] 

ASF GitHub Bot commented on NUTCH-1712:
---

GitHub user sebastian-nagel reopened a pull request:

https://github.com/apache/nutch/pull/86

NUTCH-1712 Injector to use MultipleInputs (new MR API)

Tested inject in combination with other CrawlDb tools (readdb, updatedb, 
mergedb): everything seems to work smoothly, although output files are 
part-0 or part-r-0 (for old resp. new MapReduce API).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sebastian-nagel/nutch NUTCH-1712

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/86.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #86


commit 8900e4fb8b417f1f1e46f4dcb6c02840d2a5b838
Author: Sebastian Nagel 
Date:   2015-10-19T19:48:05Z

NUTCH-1712 applied to current trunk; run first simple tests (inject + merge)

commit 11942a92bd583eca8253e2b34f259f74c0ae4b81
Author: Sebastian Nagel 
Date:   2016-01-17T20:32:31Z

add unit tests based on MRUnit

commit 712b0b0ca2883fa399e23f7f22c9ffc236ec3db4
Author: Sebastian Nagel 
Date:   2016-01-17T21:20:32Z

update tests to reflect change of reduce outputs by new API (part-n -> 
part-r-n): all unit tests pass now




> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167601#comment-15167601
 ] 

ASF GitHub Bot commented on NUTCH-1712:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/86


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117963#comment-15117963
 ] 

ASF GitHub Bot commented on NUTCH-1712:
---

GitHub user sebastian-nagel opened a pull request:

https://github.com/apache/nutch/pull/86

NUTCH-1712 Injector to use MultipleInputs (new MR API)

Tested inject in combination with other CrawlDb tools (readdb, updatedb, 
mergedb): everything seems to work smoothly, although output files are 
part-0 or part-r-0 (for old resp. new MapReduce API).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sebastian-nagel/nutch NUTCH-1712

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/86.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #86






> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093697#comment-15093697
 ] 

Markus Jelsma commented on NUTCH-1712:
--

Nice!

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092924#comment-15092924
 ] 

Sebastian Nagel commented on NUTCH-1712:


The merging is done together with minor improvements 
(https://github.com/apache/nutch/compare/trunk...sebastian-nagel:NUTCH-1712), 
but still  need to adapt test unit (TestCrawlDbStates.java).


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2015-08-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702028#comment-14702028
 ] 

Lewis John McGibbney commented on NUTCH-1712:
-

[~tejasp] we are in the process of addressing NUTCH-2049, are you interested in 
rebasing off of trunk and we can work to get this patch into trunk? 

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.11
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-03-14 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935823#comment-13935823
 ] 

Sebastian Nagel commented on NUTCH-1712:


Thanks. Looks good in general, +1 for the nicely improved command-line help. 
Open points:
# the resulting CrawlDb is not readable by some tools, e.g. {{nutch readdb 
crawldb/ -url url}} fails. Output should be MapFile not SequenceFile. 
Unluckily, o.a.h.mapreduce.lib.output.MapFileOutputFormat seems not available 
in Hadoop 1.2.0 (later versions contain the class, see MAPREDUCE-375)
# URL normalizer scope could be changed by new config property 
"crawldb.url.normalizers.scope". Do we need it? If yes, should place a 
description into nutch-default.xml
# if this property is not set: per default URLNormalizers.SCOPE_CRAWLDB is used 
instead of URLNormalizers.SCOPE_INJECT. Default should be still SCOPE_INJECT, 
right?


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880288#comment-13880288
 ] 

Tejas Patil commented on NUTCH-1712:


The performance gains due to this patch won't be phenomenal for small seeds 
file w/o any metadata and large crawldb's. The only savings with this patch is 
in terms of saving time over :-
1. dumping the output of the first job (ie. datum objects for the seed urls)
2. reading this output as input for the next job
3. job launch and cleanup.

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)