[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602110#comment-14602110
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/37

NUTCH-2038

Made changes suggested by Sebastial Nagel.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra 
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra 
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra 
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra 
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra 
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038




> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> ---
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, injector, parser
>Reporter: Asitang Mishra
>Assignee: Chris A. Mattmann
>  Labels: memex, nutch
> Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If 
> relevant then don't filter the outlinks. If irrelevant then go thru each 
> outlink and see if the url contains any of the important words from a list. 
> If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602108#comment-14602108
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/36


> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> ---
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, injector, parser
>Reporter: Asitang Mishra
>Assignee: Chris A. Mattmann
>  Labels: memex, nutch
> Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If 
> relevant then don't filter the outlinks. If irrelevant then go thru each 
> outlink and see if the url contains any of the important words from a list. 
> If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


GSOC midterm report

2015-06-25 Thread Cihad Guzel
Hi Lewis and Talat.

I have prepared midterm report for GSOC2015.  I attach to this mail as pdf
file format.

Thanks.


[jira] [Commented] (NUTCH-2041) indexer fails if linkdb is missing

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601852#comment-14601852
 ] 

Hudson commented on NUTCH-2041:
---

SUCCESS: Integrated in Nutch-trunk #3176 (See 
[https://builds.apache.org/job/Nutch-trunk/3176/])
NUTCH-2041 indexer fails if linkdb is missing (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1687612)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java


> indexer fails if linkdb is missing
> --
>
> Key: NUTCH-2041
> URL: https://issues.apache.org/jira/browse/NUTCH-2041
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, linkdb
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2014-v1.patch
>
>
> If the linkdb is missing the indexer fails with
> {noformat}
> 2015-06-17 12:52:10,621 ERROR 
> ...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: .../linkdb/current
> {noformat}
> If both db.ignore.internal.links and db.ignore.external.links there will be 
> no LinkDb even if "invertlinks" is run (as consequence of NUTCH-1913). The 
> script "bin/crawl" does not know about the values of these two properties and 
> calls indexer with "-linkdb .../linkdb" which will then fail.
> Since "bin/crawl" is agnostic to properties defined in nutch-site.xml we 
> solution similar to NUTCH-1854: make the tool/job more tolerant and log a 
> warning instead of raising an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2016) Remove unused class OldFetcher

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601853#comment-14601853
 ] 

Hudson commented on NUTCH-2016:
---

SUCCESS: Integrated in Nutch-trunk #3176 (See 
[https://builds.apache.org/job/Nutch-trunk/3176/])
NUTCH-2016 Remove unused class OldFetcher (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1687608)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/OldFetcher.java


> Remove unused class OldFetcher
> --
>
> Key: NUTCH-2016
> URL: https://issues.apache.org/jira/browse/NUTCH-2016
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11
>
>
> The class OldFetcher is not actively maintained and lacks all features added 
> to the new threaded Fetcher (started in 2007, used as default fetcher since 
> 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1730) Scoring-depth optionally not to increment depth for external hosts

2015-06-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601847#comment-14601847
 ] 

Sebastian Nagel commented on NUTCH-1730:


* there is a typo in the property name ("exteral")
{code}
ignoreExternal = conf.getBoolean("scoring.depth.ignore.exteral", false);
{code}
* the property should be described in nutch-default.xml
* what's the aim of
{code}
-if (curDepth >= curMaxDepth) {
+if (curMaxDepth > 0 && curDepth >= curMaxDepth) {
{code}
A curMaxDepth of 0 (or -1) would mean: accept everything up to an unlimited 
linkage depth. Right? Since -1 is used quite often with meaning "unlimited", 
that's a good idea. But we should make this explicit (add to Java doc, 
nutch-default,xml), and use it also for DEFAULT_MAX_DEPTH.

Just curious what's the use case? Doesn't the depth easily get out of control? 
E.g., if a seed document links to an external page which links back to a page 
deep on the first site, the deep page becomes equivalent to the seed doc.

> Scoring-depth optionally not to increment depth for external hosts
> --
>
> Key: NUTCH-1730
> URL: https://issues.apache.org/jira/browse/NUTCH-1730
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1730-trunk.patch, NUTCH-1730.patch
>
>
> Currently, the plugin always increments depth, even when coming or going to 
> external hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1684) ParseMeta to be added before fetch schedulers are run

2015-06-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601786#comment-14601786
 ] 

Sebastian Nagel commented on NUTCH-1684:


+1

> ParseMeta to be added before fetch schedulers are run
> -
>
> Key: NUTCH-1684
> URL: https://issues.apache.org/jira/browse/NUTCH-1684
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1684-trunk.patch, NUTCH-1684-trunk.patch
>
>
> FetchSchedulers cannot operate on parseMeta in the CrawlDatum because it is 
> added after the schedulers have run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2041) indexer fails if linkdb is missing

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2041:
--

Assignee: Sebastian Nagel

> indexer fails if linkdb is missing
> --
>
> Key: NUTCH-2041
> URL: https://issues.apache.org/jira/browse/NUTCH-2041
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, linkdb
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2014-v1.patch
>
>
> If the linkdb is missing the indexer fails with
> {noformat}
> 2015-06-17 12:52:10,621 ERROR 
> ...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: .../linkdb/current
> {noformat}
> If both db.ignore.internal.links and db.ignore.external.links there will be 
> no LinkDb even if "invertlinks" is run (as consequence of NUTCH-1913). The 
> script "bin/crawl" does not know about the values of these two properties and 
> calls indexer with "-linkdb .../linkdb" which will then fail.
> Since "bin/crawl" is agnostic to properties defined in nutch-site.xml we 
> solution similar to NUTCH-1854: make the tool/job more tolerant and log a 
> warning instead of raising an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1335) OutlinkDB to collect unique URL's only

2015-06-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601777#comment-14601777
 ] 

Sebastian Nagel commented on NUTCH-1335:


Reasonable. But wouldn't it be consequent to take only one (the first, the 
last, the most recent)? In the worst case, if links are just sorted from old to 
new, all of them are still taken.

> OutlinkDB to collect unique URL's only
> --
>
> Key: NUTCH-1335
> URL: https://issues.apache.org/jira/browse/NUTCH-1335
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1335-1.6-1.patch, NUTCH-1335.patch
>
>
> The aggregating code in the Outlink reducer does not take care of incoming 
> duplicates. When the input segments contain duplicates of a single URL they 
> are collected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2041) indexer fails if linkdb is missing

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2041.

Resolution: Fixed

Committed to trunk, r1687612.

> indexer fails if linkdb is missing
> --
>
> Key: NUTCH-2041
> URL: https://issues.apache.org/jira/browse/NUTCH-2041
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, linkdb
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2014-v1.patch
>
>
> If the linkdb is missing the indexer fails with
> {noformat}
> 2015-06-17 12:52:10,621 ERROR 
> ...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: .../linkdb/current
> {noformat}
> If both db.ignore.internal.links and db.ignore.external.links there will be 
> no LinkDb even if "invertlinks" is run (as consequence of NUTCH-1913). The 
> script "bin/crawl" does not know about the values of these two properties and 
> calls indexer with "-linkdb .../linkdb" which will then fail.
> Since "bin/crawl" is agnostic to properties defined in nutch-site.xml we 
> solution similar to NUTCH-1854: make the tool/job more tolerant and log a 
> warning instead of raising an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601734#comment-14601734
 ] 

Hudson commented on NUTCH-2000:
---

SUCCESS: Integrated in Nutch-trunk #3175 (See 
[https://builds.apache.org/job/Nutch-trunk/3175/])
NUTCH-2000 Link inversion fails with .locked already exists (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1687604)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java


> Link inversion fails with .locked already exists.
> -
>
> Key: NUTCH-2000
> URL: https://issues.apache.org/jira/browse/NUTCH-2000
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2000-v1.patch
>
>
> using standard crawl script with a brand new test dir in local mode I am 
> getting 
> Link inversion
> /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
> /data/BLABLABLA/testCrawl2//linkdb 
> /data/BLABLABLA/testCrawl2//segments/20150423114335
> LinkDb: java.io.IOException: lock file 
> /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
> PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2000:
--

Assignee: Sebastian Nagel

> Link inversion fails with .locked already exists.
> -
>
> Key: NUTCH-2000
> URL: https://issues.apache.org/jira/browse/NUTCH-2000
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2000-v1.patch
>
>
> using standard crawl script with a brand new test dir in local mode I am 
> getting 
> Link inversion
> /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
> /data/BLABLABLA/testCrawl2//linkdb 
> /data/BLABLABLA/testCrawl2//segments/20150423114335
> LinkDb: java.io.IOException: lock file 
> /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
> PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2016) Remove unused class OldFetcher

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2016.

Resolution: Fixed
  Assignee: Sebastian Nagel

Committed to trunk, r1687608. Thanks!

> Remove unused class OldFetcher
> --
>
> Key: NUTCH-2016
> URL: https://issues.apache.org/jira/browse/NUTCH-2016
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11
>
>
> The class OldFetcher is not actively maintained and lacks all features added 
> to the new threaded Fetcher (started in 2007, used as default fetcher since 
> 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2016) Remove unused class OldFetcher

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2016:
---
Summary: Remove unused class OldFetcher  (was: Remove OldFetcher from trunk)

> Remove unused class OldFetcher
> --
>
> Key: NUTCH-2016
> URL: https://issues.apache.org/jira/browse/NUTCH-2016
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11
>
>
> The class OldFetcher is not actively maintained and lacks all features added 
> to the new threaded Fetcher (started in 2007, used as default fetcher since 
> 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2000.

Resolution: Fixed

Committed to trunk, r1687604. Thanks!

> Link inversion fails with .locked already exists.
> -
>
> Key: NUTCH-2000
> URL: https://issues.apache.org/jira/browse/NUTCH-2000
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2000-v1.patch
>
>
> using standard crawl script with a brand new test dir in local mode I am 
> getting 
> Link inversion
> /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
> /data/BLABLABLA/testCrawl2//linkdb 
> /data/BLABLABLA/testCrawl2//segments/20150423114335
> LinkDb: java.io.IOException: lock file 
> /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
> PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-06-25 Thread Ji Kwon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601640#comment-14601640
 ] 

Ji Kwon Lim commented on NUTCH-1517:


Hi,

We are attempting to use nutch with CloudSearch, and we are using the patch 
provided in this ticket. However, we noticed that the patch seems to be 
incomplete, requiring a manual change to 
org.apache.nutch.parse,MetaTagsParser.java to replace all references to 
'metadata.add("metatag."' with 'metadata.add("metatag_"', changing out the 
period with an underscore. Is there a newer patch out that addresses this issue 
or a newer process altogether for getting nutch to work with CloudSearch? If 
not, could we get an update to the patch to include the change to 
org.apache.nutch.parse,MetaTagsParser.java that's necessary for the indexer to 
work properly?


Regards,

Ji Kwon Lim

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
> 0025666929_1382393138_indexer-cloudsearch.20131021.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601602#comment-14601602
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2036:
---

Thanks I was waiting for someone else to review, but glad to see it committed

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036-v2.patch, NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601499#comment-14601499
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

OK so it looks like the latest patch is commitable. Asitang is addressing Seb's 
comments. I will go ahead and get this committed later today after I see the 
final PR/patch. Great work.

> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> ---
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, injector, parser
>Reporter: Asitang Mishra
>Assignee: Chris A. Mattmann
>  Labels: memex, nutch
> Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If 
> relevant then don't filter the outlinks. If irrelevant then go thru each 
> outlink and see if the url contains any of the important words from a list. 
> If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-1416) IndexerMapReduce can index older version of a document instead of latest one

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-1416:

  Assignee: Sebastian Nagel

Re-opening: if you have to re-index a bunch of segments this cannot be done 
within a single job:
* either run SegmentMerger before
* or do it segment by segment in the right order with some overhead (reading 
CrawlDb and LinkDb again and again)

It should be not too hard to fix. We have to re-establish the correct ordering 
by segment name or fetch time for the 3 items coming from segments:
# fetch datum: can be sorted by fetch time, see NUTCH-1617
# ParseData contains the segment name in content metadata: use this to keep 
only the latests one
# ParseText: need a way to associate it with the segment it stems from, e.g., 
wrap it into a MetaWrapper object as SegmentMerger does

There should be not too much overhead for the default (only one segment is 
indexed): it's only wrapping ParseText and few null-checks whether one of the 
items will be overwritten. Ev., we can even optimize the one-segment-indexing.

> IndexerMapReduce can index older version of a document instead of latest one
> 
>
> Key: NUTCH-1416
> URL: https://issues.apache.org/jira/browse/NUTCH-1416
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Reporter: Jianyun He
>Assignee: Sebastian Nagel
>Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed 
> is the latest.In the class IndexerMapReduce and method reduce(), it has the 
> following code:
> public void reduce(Text key, Iterator values,
>  OutputCollector output, Reporter 
> reporter) throws IOException {
>……
>} else if (value instanceof ParseData) {  
>   parseData = (ParseData)value;
>} else if (value instanceof ParseText) { 
>   parseText = (ParseText)value;
>}
>……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then 
> the key A will correspond to two ParseData objects(located in different 
> segments).But in this code,it does not compare the fetch time and simply 
> overwrites the previous value.So the final value maybe the old one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2015-06-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601314#comment-14601314
 ] 

Sebastian Nagel commented on NUTCH-1625:


Segments have been indexed within one single indexing job? There is still 
NUTCH-1416 which does not allow to index safely multiple segments with 
overlapping content because no ordering of items (CrawlDatums, ParseData, etc.) 
with same key/URL is guaranteed. Some classes we cannot keep in order without 
wrapping them with the segment name (as done by SegmentMerger), for others we 
could re-establish the order, see NUTCH-1617. I'll re-open NUTCH-1416 - should 
be not too hard to fix and it's really annoying if there is no simple way to 
re-index a bunch of segments. If we get it fixed your patch is a necessary 
part. 

> IndexerMapReduce skips FETCH_NOTMODIFIED
> 
>
> Key: NUTCH-1625
> URL: https://issues.apache.org/jira/browse/NUTCH-1625
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Critical
> Fix For: 1.11
>
> Attachments: NUTCH-1625.patch, NUTCH-1625.patch
>
>
> IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also 
> skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check 
> but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED 
> isn't very useful anyway because since 1.5 orso we can safely rely on 
> DB_NOTMODIFIED as it is properly set in the CrawlDBReducer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601291#comment-14601291
 ] 

Hudson commented on NUTCH-2036:
---

SUCCESS: Integrated in Nutch-trunk #3174 (See 
[https://builds.apache.org/job/Nutch-trunk/3174/])
Adding some continuous crawl goodies to the crawl script NUTCH-2036 (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1687522)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/bin/crawl


> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036-v2.patch, NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2036.
--
Resolution: Fixed

Committed revision 1687522.

Thanks!

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036-v2.patch, NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2036:
---
Attachment: NUTCH-2036-v2.patch

+1, attached patch does only improve the logging

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036-v2.patch, NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601083#comment-14601083
 ] 

Sebastian Nagel edited comment on NUTCH-2036 at 6/25/15 12:00 PM:
--

+1, attached new patch which does only improve the logging


was (Author: wastl-nagel):
+1, attached patch does only improve the logging

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036-v2.patch, NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2015-06-25 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601006#comment-14601006
 ] 

Markus Jelsma commented on NUTCH-1625:
--

Hello Sebastian, we used this in Nutch 1.6 when we reindexed many old segments, 
containing duplicates. At the time we had trouble reindexing those segments, 
some entries didn't make it in the index. We fixed that issue with that patch. 
I looked at the current code again, but i think the problem is still there in 
the case of many segments and containing duplicates.

> IndexerMapReduce skips FETCH_NOTMODIFIED
> 
>
> Key: NUTCH-1625
> URL: https://issues.apache.org/jira/browse/NUTCH-1625
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Critical
> Fix For: 1.11
>
> Attachments: NUTCH-1625.patch, NUTCH-1625.patch
>
>
> IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also 
> skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check 
> but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED 
> isn't very useful anyway because since 1.5 orso we can safely rely on 
> DB_NOTMODIFIED as it is properly set in the CrawlDBReducer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600989#comment-14600989
 ] 

Markus Jelsma commented on NUTCH-2036:
--

Seems fine to me :)

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600949#comment-14600949
 ] 

Julien Nioche commented on NUTCH-2036:
--

Any thoughts on this? This is useful and should be committed I think.

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2036:
-
Fix Version/s: 1.11

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2036:
-
Affects Version/s: (was: 1.11)

> Adding some continuous crawl goodies to the crawl script
> 
>
> Key: NUTCH-2036
> URL: https://issues.apache.org/jira/browse/NUTCH-2036
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin, tool, util
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: crawl, script
> Fix For: 1.11
>
> Attachments: NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes 
> this is somehow doable using cron or even sometimes irrelevant due the size 
> of the crawl its a nice feature to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} 
> script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
> no URLs are scheduled for fetching). 
> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
> provided the amount of time is assumed to be in seconds. Other valid suffixes 
> are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the 
> default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2016) Remove OldFetcher from trunk

2015-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600946#comment-14600946
 ] 

Julien Nioche commented on NUTCH-2016:
--

+1

> Remove OldFetcher from trunk
> 
>
> Key: NUTCH-2016
> URL: https://issues.apache.org/jira/browse/NUTCH-2016
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11
>
>
> The class OldFetcher is not actively maintained and lacks all features added 
> to the new threaded Fetcher (started in 2007, used as default fetcher since 
> 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2016) Remove OldFetcher from trunk

2015-06-25 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600901#comment-14600901
 ] 

Markus Jelsma commented on NUTCH-2016:
--

+1

> Remove OldFetcher from trunk
> 
>
> Key: NUTCH-2016
> URL: https://issues.apache.org/jira/browse/NUTCH-2016
> Project: Nutch
>  Issue Type: Wish
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11
>
>
> The class OldFetcher is not actively maintained and lacks all features added 
> to the new threaded Fetcher (started in 2007, used as default fetcher since 
> 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2041) indexer fails if linkdb is missing

2015-06-25 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600900#comment-14600900
 ] 

Markus Jelsma commented on NUTCH-2041:
--

+1

> indexer fails if linkdb is missing
> --
>
> Key: NUTCH-2041
> URL: https://issues.apache.org/jira/browse/NUTCH-2041
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, linkdb
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2014-v1.patch
>
>
> If the linkdb is missing the indexer fails with
> {noformat}
> 2015-06-17 12:52:10,621 ERROR 
> ...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: .../linkdb/current
> {noformat}
> If both db.ignore.internal.links and db.ignore.external.links there will be 
> no LinkDb even if "invertlinks" is run (as consequence of NUTCH-1913). The 
> script "bin/crawl" does not know about the values of these two properties and 
> calls indexer with "-linkdb .../linkdb" which will then fail.
> Since "bin/crawl" is agnostic to properties defined in nutch-site.xml we 
> solution similar to NUTCH-1854: make the tool/job more tolerant and log a 
> warning instead of raising an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [IMPORTANT] Migration Towards HAdoop 2.X --> 3.X

2015-06-25 Thread Markus Jelsma
Hello Lewis, trunk runs fine on a Hadoop 2.6 cluster. We have not seen any 
issues. I did some attempts at first to port some jobs to the new mapreduce 
API, but there were problems and some API's didn't exist in the new mapreduce 
API. As far as i know the old mapred API is not going to be deprecated for a 
long while to come.

Re: Nutch 3.0, i see no point in that so far.

Markus

-Original message-
From: Lewis John Mcgibbney
Sent: Thursday 25th June 2015 1:19
To: dev@nutch.apache.org
Subject: [IMPORTANT] Migration Towards HAdoop 2.X --> 3.X

Hi Folks,

In not too long time Hadoop will be up at 3.X for stable official releases.

I wanted to solicit the dev@ community to see what difficulties if any people 
have had running Nutch trunk on Hadoop 2.X.

Hadoop 2.X is supported on Nutch 2.X but getting the patches all correct is 
literally a PITA... we are working on that down in the Gora community and need 
to get a better more frequent release cycle.

I just wanted to know if there was motivation for us to get some patches 
committed to trunk, releases it as 1.11 then focus the next development drive 
on a switch to Hadoop 2.X for trunk.

We could potentially then release Nutch > 1.11 as 3.0.

What do you guys think?

Thanks

Lewis

--

Lewis