[jira] [Commented] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology

2020-07-10 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155635#comment-17155635
 ] 

Lewis John McGibbney commented on NUTCH-2802:
-

[~snagel] thanks for opening this one. I'll go ahead and create a PR shortly. 

> Replace blacklist/whitelist by more inclusive and precise terminology
> -
>
> Key: NUTCH-2802
> URL: https://issues.apache.org/jira/browse/NUTCH-2802
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, plugin
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> The terms blacklist and whitelist should be replaced by a more inclusive and 
> more precise terminology, see the proposal and discussion on the @dev mailing 
> list 
> ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E],
>  
> [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]).
> This is an umbrella issue, subtasks to be opened for individual plugins and 
> configuration properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology

2020-07-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2802:
---

Assignee: Lewis John McGibbney

> Replace blacklist/whitelist by more inclusive and precise terminology
> -
>
> Key: NUTCH-2802
> URL: https://issues.apache.org/jira/browse/NUTCH-2802
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, plugin
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> The terms blacklist and whitelist should be replaced by a more inclusive and 
> more precise terminology, see the proposal and discussion on the @dev mailing 
> list 
> ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E],
>  
> [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]).
> This is an umbrella issue, subtasks to be opened for individual plugins and 
> configuration properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[PROPOSAL] Replace whitelist blacklist with allowlist denylist

2020-06-09 Thread lewis john mcgibbney
Hi Folks,

*What*
I would like to propose that we replace source code coining 'whiteList' and
'blackList'-esque terms/phrases with some more representative language e.g.
allowList, denyList.

*Where*
* subcollection plugin -
https://github.com/apache/nutch/blob/master/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java#L46-L47
* urlfilter-domainblacklist plugin -
https://github.com/apache/nutch/tree/master/src/plugin/urlfilter-domainblacklist

*Why*
I think we could and should use more neutral terminology and lead by
example.
I want to STRESS that this proposal is by no means an effort by me to
reflect negatively on the authors or their EXCELLENT contributions to
Nutch. I hope this is taken in good faith and we as a community can come
together on this one.

*How*
Please voice your opinions here and we can take it from there. I would
personally love to hear all opinions and I will personally take any
action(s) if we decide to go forward with the proposal.

Thank you for your consideration folks.

Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Interest in static source code analysis with sonarcloud.io

2019-12-12 Thread lewis john mcgibbney
Hi dev@,
I posted on this topic previously but cannot find the thread.
We'll it turns out that we have made a slight bit of progress.
See https://issues.apache.org/jira/browse/INFRA-19474 for context.
Is anyone else registered in sonarcloud.io? If so, can you please update
INFRA-19474 as follows

https://s.apache.org/oecsn

Thank you
Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Assigned] (NUTCH-1863) Add JSON format dump output to readdb command

2019-12-03 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1863:
---

Assignee: Shashanka Balakuntala Srinivasa

> Add JSON format dump output to readdb command
> -
>
> Key: NUTCH-1863
> URL: https://issues.apache.org/jira/browse/NUTCH-1863
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 2.3, 1.10
>    Reporter: Lewis John McGibbney
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Major
>
> Opening up the ability for third parties to consume Nutch crawldb data as 
> JSON would be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X to enable JSON 
> dumps of crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1863) Add JSON format dump output to readdb command

2019-12-03 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987208#comment-16987208
 ] 

Lewis John McGibbney commented on NUTCH-1863:
-

+1, please go ahead [~balaShashanka]

> Add JSON format dump output to readdb command
> -
>
> Key: NUTCH-1863
> URL: https://issues.apache.org/jira/browse/NUTCH-1863
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 2.3, 1.10
>    Reporter: Lewis John McGibbney
>Priority: Major
>
> Opening up the ability for third parties to consume Nutch crawldb data as 
> JSON would be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X and 2.X to 
> enable JSON dumps of crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-1863) Add JSON format dump output to readdb command

2019-12-03 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1863:

Description: 
Opening up the ability for third parties to consume Nutch crawldb data as JSON 
would be a poisitive thing IMHO.
This issue should improve the readdb functionality of both 1.X to enable JSON 
dumps of crawldb data.

  was:
Opening up the ability for third parties to consume Nutch crawldb data as JSON 
would be a poisitive thing IMHO.
This issue should improve the readdb functionality of both 1.X and 2.X to 
enable JSON dumps of crawldb data.


> Add JSON format dump output to readdb command
> -
>
> Key: NUTCH-1863
> URL: https://issues.apache.org/jira/browse/NUTCH-1863
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 2.3, 1.10
>    Reporter: Lewis John McGibbney
>Priority: Major
>
> Opening up the ability for third parties to consume Nutch crawldb data as 
> JSON would be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X to enable JSON 
> dumps of crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Static source code anlysis via sonarcloud.io

2019-11-08 Thread lewis john mcgibbney
Hi dev@,
Quick heads up, I am working on sonarcloud.io analysis for Nutch master
branch.
Reasoning being, that I did this previously whilst we hosted SonarQube
internally at Apache... but didn't really do anything about it. This is a
renewed attempt to study the improvements which can be made on the Nutch
source code.
I'll update once I have news.
Best
Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Comment Edited] (NUTCH-2677) Update Jest client in indexer-elastic-rest plugin

2019-10-30 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963221#comment-16963221
 ] 

Lewis John McGibbney edited comment on NUTCH-2677 at 10/30/19 4:58 PM:
---

[~balaShashanka]

bq. can i work on this issue? 

yes

bq. Or is anybody working on this already?

No, however please see NUTCH-2739 which supersedes this issue. 

Hopefully the description provides enough information. 


was (Author: lewismc):
[~balaShashanka]

bq. can i work on this issue? 

yes

bq. Or is anybody working on this already?

No

Hopefully the description provides enough information. 

> Update Jest client in indexer-elastic-rest plugin
> -
>
> Key: NUTCH-2677
> URL: https://issues.apache.org/jira/browse/NUTCH-2677
> Project: Nutch
>  Issue Type: Task
>  Components: indexer, plugin
>Affects Versions: 1.15
>    Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> We should really upgrade the dependency to a more recent version
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/ivy.xml
> We are using 2.0.1, the most recent is 6.3.1
> https://search.maven.org/artifact/io.searchbox/jest/6.3.1/jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2677) Update Jest client in indexer-elastic-rest plugin

2019-10-30 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963221#comment-16963221
 ] 

Lewis John McGibbney commented on NUTCH-2677:
-

[~balaShashanka]

bq. can i work on this issue? 

yes

bq. Or is anybody working on this already?

No

Hopefully the description provides enough information. 

> Update Jest client in indexer-elastic-rest plugin
> -
>
> Key: NUTCH-2677
> URL: https://issues.apache.org/jira/browse/NUTCH-2677
> Project: Nutch
>  Issue Type: Task
>  Components: indexer, plugin
>Affects Versions: 1.15
>    Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> We should really upgrade the dependency to a more recent version
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/ivy.xml
> We are using 2.0.1, the most recent is 6.3.1
> https://search.maven.org/artifact/io.searchbox/jest/6.3.1/jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[SECURITY] Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

2019-10-14 Thread lewis john mcgibbney
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

Vulnerable Versions: 2.3.1 (1.16 is not vulnerable)

Disclosure date: 2018-10-22

Credit: Pierre Ernst, Salesforce

Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site
containing malicious content

Description: The reporter found an RCE security vulnerability in Nutch
2.3.1 when crawling a web site that links a doctored Matlab file. This was
due to unsafe deserialization of user generated content. The root cause is
2 outdated 3rd party dependencies: 1. Apache Tika version 1.10
(CVE-2016-6809) 2. Apache Commons Collections 4 version 4.0
(COLLECTIONS-580) Upgrading these 2 dependencies to the latest version will
fix the issue.

Resolution: The Apache Nutch Project Management Committee released Apache
Nutch 2.4 on 2019-10-11 (https://s.apache.org/uw8i3). All users of the 2.X
branch should upgrade to this version immediately. In addition, note that
we expect that v2.4 is the last release on the 2.x series. The Nutch PMC
decided to freeze the development on the 2.x branch for now, as no
committers are actively working on it. See the above hyperlink for more
information on upgrading and the 2.x retirement decision.

Contact: either dev[at] or private[at]nutch[dot]apache[dot]org depending on
the nature of your contact.

Regards lewismc
(On behalf of the Apache Nutch PMC)
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Work stopped] (NUTCH-2307) Implement Missing NutchServer REST API Tests

2019-10-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2307 stopped by Lewis John McGibbney.
---
> Implement Missing NutchServer REST API Tests
> 
>
> Key: NUTCH-2307
> URL: https://issues.apache.org/jira/browse/NUTCH-2307
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api, web gui
>Affects Versions: 2.3.1
>Reporter: Furkan Kamaci
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.5
>
>
> TestAPI.java was all commented. Reason was indicated as:
> {quote}
> CURRENTLY DISABLED. TESTS ARE FLAPPING FOR NO APPARENT REASON.
> SHALL BE FIXED OR REPLACES BY NEW API IMPLEMENTATION
> {quote}
> So, we should implement that missing tests based on new 
> AbstractNutchAPITestBase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-2307) Implement Missing NutchServer REST API Tests

2019-10-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2307:
---

Assignee: Lewis John McGibbney

> Implement Missing NutchServer REST API Tests
> 
>
> Key: NUTCH-2307
> URL: https://issues.apache.org/jira/browse/NUTCH-2307
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api, web gui
>Affects Versions: 2.3.1
>Reporter: Furkan Kamaci
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.5
>
>
> TestAPI.java was all commented. Reason was indicated as:
> {quote}
> CURRENTLY DISABLED. TESTS ARE FLAPPING FOR NO APPARENT REASON.
> SHALL BE FIXED OR REPLACES BY NEW API IMPLEMENTATION
> {quote}
> So, we should implement that missing tests based on new 
> AbstractNutchAPITestBase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work stopped] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2019-10-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1709 stopped by Lewis John McGibbney.
---
> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain 
> methods not defined in source .avsc
> -
>
> Key: NUTCH-1709
> URL: https://issues.apache.org/jira/browse/NUTCH-1709
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.5
>
> Attachments: NUTCH-1709.patch
>
>
> When using the GoraCompiler currently packaged with gora-core-0.4-SNAPSHOT, 
> the following methods are removed from o.a.n.storage.Host or 
> o.a.n.storage.ProtocolStatus
> {code:title=Host.java|borderStyle=solid}
>   public boolean contains(String key) {
> return metadata.containsKey(new Utf8(key));
>   }
>   
>   public String getValue(String key, String defaultValue) {
> if (!contains(key)) return defaultValue;
> return Bytes.toString(metadata.get(new Utf8(key)));
>   }
>   
>   public int getInt(String key, int defaultValue) {
> if (!contains(key)) return defaultValue;
> return Integer.parseInt(getValue(key,null));
>   }
>   public long getLong(String key, long defaultValue) {
> if (!contains(key)) return defaultValue;
> return Long.parseLong(getValue(key,null));
>   }
> {code}
> {code:title=ProtocolStatus.java|borderStyle=solid}
>   /**
>* A convenience method which returns a successful {@link ProtocolStatus}.
>* @return the {@link ProtocolStatus} value for 200 (success).
>*/
>   public boolean isSuccess() {
> return code == ProtocolStatusUtils.SUCCESS; 
>   }
> {code}
> This results in compilation errors... I am not sure if it is good practice 
> for non-default methods to be contained within generated Persistent classes. 
> This is certainly the case with newer versions of Avro when using the Java 
> API.
> compile-core:
> [javac] Compiling 104 source files to 
> /home/mary/Downloads/apache/2.x/build/classes
> [javac] warning: [options] bootstrap class path not set in conjunction 
> with -source 1.6
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:345:
>  error: cannot find symbol
> [javac]host.getInt("q_mt", 
> maxThreads),
> [javac]^
> [javac]   symbol:   method getInt(String,int)
> [javac]   location: variable host of type Host
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:346:
>  error: cannot find symbol
> [javac]host.getLong("q_cd", 
> crawlDelay),
> [javac]^
> [javac]   symbol:   method getLong(String,long)
> [javac]   location: variable host of type Host
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:347:
>  error: cannot find symbol
> [javac]host.getLong("q_mcd", 
> minCrawlDelay));
> [javac]^
> [javac]   symbol:   method getLong(String,long)
> [javac]   location: variable host of type Host
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/parse/ParserChecker.java:114:
>  error: cannot find symbol
> [javac] if(!protocolOutput.getStatus().isSuccess()) {
> [javac]   ^
> [javac]   symbol:   method isSuccess()
> [javac]   location: class ProtocolStatus
> [javac] Note: 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/storage/Host.java 
> uses unchecked or unsafe operations.
> [javac] Note: Recompile with -Xlint:unchecked for details.
> [javac] 4 errors
> [javac] 1 warning
> I think it would be a good idea to find another home for such methods as it 
> will undoubtedly avoid problems when we do Gora upgrades in the future.
> Right now I don't have a suggestion but will work on a solution non-the-less.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2722) Fetch dependencies via https

2019-10-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2722:

Fix Version/s: (was: 2.5)

> Fetch dependencies via https
> 
>
> Key: NUTCH-2722
> URL: https://issues.apache.org/jira/browse/NUTCH-2722
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 2.5, 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Dependencies need to be fetched via https, see 
> https://central.sonatype.org/articles/2019/Apr/30/http-access-to-repo1mavenorg-and-repomavenapacheorg-is-being-deprecated/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-03 Thread Lewis John McGibbney
Hi Seb,
Sigs check out fine

gpg --verify apache-nutch-1.16-src.tar.gz.asc apache-nutch-1.16-src.tar.gz

gpg: Signature made Wed Oct  2 08:07:47 2019 PDT
gpg:using RSA key FF82A487F92D70E52FF77E0AC66EA7B7DB0A9C6D
gpg: Good signature from "Sebastian Nagel " [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: FF82 A487 F92D 70E5 2FF7  7E0A C66E A7B7 DB0A 9C6D

sha512sum --check apache-nutch-1.16-src.tar.gz.sha512
apache-nutch-1.16-src.tar.gz: OK

Tests pass successfully

All top level files are fine in terms of dates and licenses.

[X] +1 Release this package as Apache Nutch 1.16.

On 2019/10/02 17:54:59, Sebastian Nagel  wrote: 
> Hi Folks,
> 
> A first candidate for the Nutch 1.16 release is available at:
> 
>https://dist.apache.org/repos/dist/dev/nutch/1.16/
> 
> The release candidate is a zip and tar.gz archive of the binary and sources 
> in:
>https://github.com/apache/nutch/tree/release-1.16
> 
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1017/
> 
> We addressed 104 Issues:
>https://s.apache.org/l2j94
> 
> Please vote on releasing this package as Apache Nutch 1.16.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Nutch 1.16.
> [ ] -1 Do not release this package because…
> 
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
> 
> P.S. Here is my +1.
> 


[jira] [Comment Edited] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-03-13 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792267#comment-16792267
 ] 

Lewis John McGibbney edited comment on NUTCH-2669 at 3/14/19 2:32 AM:
--

[~wastl-nagel] this has become a blocker issue whilst attempting to roll the 
2.4 release candidate. 

I've tried using multiple combinations of proposed fixes but I cannot get Nutch 
branch-2.4 to build from source any more. 

{code}
ant clean test
...

...

...

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/Users/lmcgibbn/Downloads/nutch/ivy/ivysettings.xml
[ivy:resolve]
[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve]   module not found: 
javax.measure#unit-api;working@LMC-056430
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/lmcgibbn/.ivy2/local/javax.measure/unit-api/working@LMC-056430/ivys/ivy.xml
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LMC-056430!unit-api.jar:
[ivy:resolve] 
/Users/lmcgibbn/.ivy2/local/javax.measure/unit-api/working@LMC-056430/jars/unit-api.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LMC-056430!unit-api.jar:
[ivy:resolve] 
http://repo1.maven.org/maven2/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar
[ivy:resolve]    sonatype: tried
[ivy:resolve] 
http://oss.sonatype.org/content/repositories/releases/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LMC-056430!unit-api.jar:
[ivy:resolve] 
http://oss.sonatype.org/content/repositories/releases/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar
[ivy:resolve]    apache-snapshot: tried
[ivy:resolve] 
https://repository.apache.org/content/repositories/snapshots/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LMC-056430!unit-api.jar:
[ivy:resolve] 
https://repository.apache.org/content/repositories/snapshots/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar
[ivy:resolve]    restlet: tried
[ivy:resolve] 
http://maven.restlet.org/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LMC-056430!unit-api.jar:
[ivy:resolve] 
http://maven.restlet.org/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar
[ivy:resolve]  ERRORS
[ivy:resolve]   impossible to get artifacts when data has not been loaded. 
IvyNode = javax.measure#unit-api;1.0
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
{code}




was (Author: lewismc):
[~wastl-nagel] has become a major pain whilst attempting to roll the 2.4 
release candidate. The release is essentially blocked until this issue is 
resolved. 

> Reliable solution for javax.ws packaging.type
> -
>
> Key: NUTCH-2669
> URL: https://issues.apache.org/jira/browse/NUTCH-2669
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Blocker
> Fix For: 2.4
>
>
> The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an 
> ant/ivy issue during build when resolving/fetching dependencies:
> {noformat}
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve]   
> /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve]   
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve]   
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve]   
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] ::  FAILED DOWNLOADS::
> [ivy:resolve] :: ^ see resolution messages for detai

[jira] [Updated] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-03-13 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2669:

Priority: Blocker  (was: Major)

> Reliable solution for javax.ws packaging.type
> -
>
> Key: NUTCH-2669
> URL: https://issues.apache.org/jira/browse/NUTCH-2669
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Blocker
> Fix For: 2.5
>
>
> The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an 
> ant/ivy issue during build when resolving/fetching dependencies:
> {noformat}
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve]   
> /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve]   
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve]   
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve]   
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] ::  FAILED DOWNLOADS::
> [ivy:resolve] :: ^ see resolution messages for details  ^ ::
> [ivy:resolve] ::
> [ivy:resolve] :: 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve]  ERRORS
> ...
> BUILD FAILED
> {noformat}
> More information about this issue is linked on 
> [jax-rs#576|https://github.com/jax-rs/api/pull/576]. 
> A work-around is to define a property {{packaging.type}} and set it to 
> {{jar}}. This can be done
> - in command-line {{ant -Dpackaging.type=jar ...}}
> - in default.properties
> - in ivysettings.xml
> The last work-around is active in current master/1.x. However, there are 
> still Jenkins builds failing while few succeed:
> ||#build||status jax-rs||machine||work-around||
> |3578|success|H28|ivysettings.xml|
> |3577|failed|H28|ivysettings.xml|
> |3576|failed|H33|ivysettings.xml|
> |3575|success|ubuntu-4|ivysettings.xml|
> |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties|
> |3571|failed|?|-Dpackaging.type=jar + default.properties|
> |3568|failed|?|-Dpackaging.type=jar + default.properties|
> Builds which failed for other reasons are left away. The only pattern I see 
> is that only the second build on every of the Jenkins machines succeeds. A 
> possible reason could be that the build environments on the machines persist 
> state (the Nutch build directory, local ivy cache, etc.). If this is the 
> case, it may take some time until all Jenkins machines will succeed.
> The ivysettings.xml work-around was the first which succeeded on a Jenkins 
> build but it may be the case that all three work-arounds apply.
> The issue is supposed to be resolved (without work-arounds) by IVY-1577. 
> However, it looks like it isn't:
> - get rc2 of ivy 2.5.0 (the URL may change):
> {noformat}
> % wget -O ivy/ivy-2.5.0-rc2-test.jar \
> 
> https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar
> {noformat}
> - edit default properties and set {{ivy.version=2.5.0-rc2-test}}
> - remove work-around in ivysettings.xml (or default.properties)
> - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib 
> is in place: {{ls build/lib/javax.ws.rs-api*.jar}}
> This solution fails for 
> [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar]
>  and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is 
> wrong, I'll contact the ant/ivy team to solve this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-03-13 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2669:

Fix Version/s: (was: 2.5)
   2.4

> Reliable solution for javax.ws packaging.type
> -
>
> Key: NUTCH-2669
> URL: https://issues.apache.org/jira/browse/NUTCH-2669
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Blocker
> Fix For: 2.4
>
>
> The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an 
> ant/ivy issue during build when resolving/fetching dependencies:
> {noformat}
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve]   
> /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve]   
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve]   
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve]   
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] ::  FAILED DOWNLOADS::
> [ivy:resolve] :: ^ see resolution messages for details  ^ ::
> [ivy:resolve] ::
> [ivy:resolve] :: 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve]  ERRORS
> ...
> BUILD FAILED
> {noformat}
> More information about this issue is linked on 
> [jax-rs#576|https://github.com/jax-rs/api/pull/576]. 
> A work-around is to define a property {{packaging.type}} and set it to 
> {{jar}}. This can be done
> - in command-line {{ant -Dpackaging.type=jar ...}}
> - in default.properties
> - in ivysettings.xml
> The last work-around is active in current master/1.x. However, there are 
> still Jenkins builds failing while few succeed:
> ||#build||status jax-rs||machine||work-around||
> |3578|success|H28|ivysettings.xml|
> |3577|failed|H28|ivysettings.xml|
> |3576|failed|H33|ivysettings.xml|
> |3575|success|ubuntu-4|ivysettings.xml|
> |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties|
> |3571|failed|?|-Dpackaging.type=jar + default.properties|
> |3568|failed|?|-Dpackaging.type=jar + default.properties|
> Builds which failed for other reasons are left away. The only pattern I see 
> is that only the second build on every of the Jenkins machines succeeds. A 
> possible reason could be that the build environments on the machines persist 
> state (the Nutch build directory, local ivy cache, etc.). If this is the 
> case, it may take some time until all Jenkins machines will succeed.
> The ivysettings.xml work-around was the first which succeeded on a Jenkins 
> build but it may be the case that all three work-arounds apply.
> The issue is supposed to be resolved (without work-arounds) by IVY-1577. 
> However, it looks like it isn't:
> - get rc2 of ivy 2.5.0 (the URL may change):
> {noformat}
> % wget -O ivy/ivy-2.5.0-rc2-test.jar \
> 
> https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar
> {noformat}
> - edit default properties and set {{ivy.version=2.5.0-rc2-test}}
> - remove work-around in ivysettings.xml (or default.properties)
> - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib 
> is in place: {{ls build/lib/javax.ws.rs-api*.jar}}
> This solution fails for 
> [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar]
>  and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is 
> wrong, I'll contact the ant/ivy team to solve this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-03-13 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792267#comment-16792267
 ] 

Lewis John McGibbney commented on NUTCH-2669:
-

[~wastl-nagel] has become a major pain whilst attempting to roll the 2.4 
release candidate. The release is essentially blocked until this issue is 
resolved. 

> Reliable solution for javax.ws packaging.type
> -
>
> Key: NUTCH-2669
> URL: https://issues.apache.org/jira/browse/NUTCH-2669
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.5
>
>
> The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an 
> ant/ivy issue during build when resolving/fetching dependencies:
> {noformat}
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve]   
> /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve]   
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve]   
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve]   
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] ::  FAILED DOWNLOADS::
> [ivy:resolve] :: ^ see resolution messages for details  ^ ::
> [ivy:resolve] ::
> [ivy:resolve] :: 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve]  ERRORS
> ...
> BUILD FAILED
> {noformat}
> More information about this issue is linked on 
> [jax-rs#576|https://github.com/jax-rs/api/pull/576]. 
> A work-around is to define a property {{packaging.type}} and set it to 
> {{jar}}. This can be done
> - in command-line {{ant -Dpackaging.type=jar ...}}
> - in default.properties
> - in ivysettings.xml
> The last work-around is active in current master/1.x. However, there are 
> still Jenkins builds failing while few succeed:
> ||#build||status jax-rs||machine||work-around||
> |3578|success|H28|ivysettings.xml|
> |3577|failed|H28|ivysettings.xml|
> |3576|failed|H33|ivysettings.xml|
> |3575|success|ubuntu-4|ivysettings.xml|
> |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties|
> |3571|failed|?|-Dpackaging.type=jar + default.properties|
> |3568|failed|?|-Dpackaging.type=jar + default.properties|
> Builds which failed for other reasons are left away. The only pattern I see 
> is that only the second build on every of the Jenkins machines succeeds. A 
> possible reason could be that the build environments on the machines persist 
> state (the Nutch build directory, local ivy cache, etc.). If this is the 
> case, it may take some time until all Jenkins machines will succeed.
> The ivysettings.xml work-around was the first which succeeded on a Jenkins 
> build but it may be the case that all three work-arounds apply.
> The issue is supposed to be resolved (without work-arounds) by IVY-1577. 
> However, it looks like it isn't:
> - get rc2 of ivy 2.5.0 (the URL may change):
> {noformat}
> % wget -O ivy/ivy-2.5.0-rc2-test.jar \
> 
> https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar
> {noformat}
> - edit default properties and set {{ivy.version=2.5.0-rc2-test}}
> - remove work-around in ivysettings.xml (or default.properties)
> - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib 
> is in place: {{ls build/lib/javax.ws.rs-api*.jar}}
> This solution fails for 
> [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar]
>  and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is 
> wrong, I'll contact the ant/ivy team to solve this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2498) Docker files are outdated

2019-03-09 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788812#comment-16788812
 ] 

Lewis John McGibbney commented on NUTCH-2498:
-

[~dhirajforyou] thank you for reporting this. I am just about to push a release 
for Nutch 2.4 but this is the last release for the 2.x release line. Do you 
want to provide a pull request? If not then please just resolve this as won't 
fix.

> Docker files are outdated
> -
>
> Key: NUTCH-2498
> URL: https://issues.apache.org/jira/browse/NUTCH-2498
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.4
>Reporter: dhirajforyou
>Priority: Blocker
>  Labels: build
> Fix For: 2.4
>
>
> Docker file for hbase is outdated. It uses java 7 but, nutch requires java 8.
> Casandra docker file refers to meabed/debian-jdk, which is also based on 
> java7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Mavenize Nutch Build as Google Summer of Code

2019-03-09 Thread lewis john mcgibbney
Hi user@ and dev@,
If you are a student and would like to tackle the task of Mavenizing the
Nutch master build please get in touch with me here, directly or comment on
the following issue
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-2292
Thank you
Lewis
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Resolved] (NUTCH-2698) Remove sonar build task from build.xml

2019-03-05 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2698.
-
Resolution: Fixed

Thanks [~wastl-nagel] for review. 

> Remove sonar build task from build.xml
> --
>
> Key: NUTCH-2698
> URL: https://issues.apache.org/jira/browse/NUTCH-2698
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.15
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.16
>
>
> build.xml currently has the following content
> {code}
>   
>   
>   
>   
>   
> 
> 
>   
>   
>   
> 
> 
> 
>  />
> 
> 
> 
>   version="1.4-SNAPSHOT" xmlns:sonar="antlib:org.sonar.ant"/>
>   
> {code}
> We should simply remove as it is defunct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2019-03-05 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2292:

Labels: gsoc2019  (was: )

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>  Labels: gsoc2019
> Fix For: 1.16
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2019-03-02 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782482#comment-16782482
 ] 

Lewis John McGibbney commented on NUTCH-2292:
-

Hi [~wastl-nagel] long story short... we need to rebase this against master and 
then talk through the issue. As it was a while ago, I've all but forgotten what 
decisions we made at the time and what the implications are/were.

What about we propose this as a GSoC project?

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
> Fix For: 1.16
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2698) Remove sonar build task from build.xml

2019-03-02 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2698:
---

 Summary: Remove sonar build task from build.xml
 Key: NUTCH-2698
 URL: https://issues.apache.org/jira/browse/NUTCH-2698
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.15
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.16


build.xml currently has the following content
{code}
  
  
  

  
  


  

  
  











  
{code}

We should simply remove as it is defunct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2697) Upgrade Ivy to fix the issue of an unset packaging.type property.

2019-03-02 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782477#comment-16782477
 ] 

Lewis John McGibbney commented on NUTCH-2697:
-

Apologies folks, thank you [~wastl-nagel] for reverting. This has prompted me 
to look at another part of build.xml for a fix so I will go ahead and submit 
that.

> Upgrade Ivy to fix the issue of an unset packaging.type property.
> -
>
> Key: NUTCH-2697
> URL: https://issues.apache.org/jira/browse/NUTCH-2697
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.16
>Reporter: Chris Gavin
>Priority: Major
> Fix For: 1.16
>
>
> Currently Nutch fails to build from a clean checkout due to 
> {{packaging.type}} not being set (even with the current workaround in 
> {{ivysettings.xml}}).
> {code:java}
> [ivy:resolve] :: problems summary ::
> [ivy:resolve]  WARNINGS
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}: (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve] 
> /opt/work/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve] 
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve] 
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve] 
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] :: FAILED DOWNLOADS ::
> [ivy:resolve] :: ^ see resolution messages for details ^ ::
> [ivy:resolve] ::
> [ivy:resolve] :: 
> javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] 
> BUILD FAILED{code}
> This issue has been fixed in the latest version of Ivy so upgrading will 
> cause the build to work correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2679) "ant eclipse" failed as eclipse binary is moved

2018-12-12 Thread lewis john mcgibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718828#comment-16718828
 ] 

lewis john mcgibbney commented on NUTCH-2679:
-

Can we use
https://search.maven.org/artifact/ant4eclipse/ant4eclipse/0.5.0.rc1/jar ?


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


> "ant eclipse" failed as eclipse binary is moved
> ---
>
> Key: NUTCH-2679
> URL: https://issues.apache.org/jira/browse/NUTCH-2679
> Project: Nutch
>  Issue Type: Test
>  Components: build
>Affects Versions: 1.15
>Reporter: dhirajforyou
>Priority: Major
>
>  
> curl -I 
> "https://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2;
> HTTP/1.1 302 Found
> Server: nginx/1.13.12
> Date: Wed, 12 Dec 2018 10:32:28 GMT
> Content-Type: text/html; charset=UTF-8
> Connection: keep-alive
> Content-Disposition: attachment; filename="ant-eclipse-1.0.bin.tar.bz2"
> Set-Cookie: 
> sf_mirror_attempt="ant-eclipse:liquidtelecom:ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2";
>  Max-Age=120; Path=/; expires=Wed, 12-Dec-2018 10:34:28 GMT
> Location: 
> [https://liquidtelecom.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2]
>  
> so the eclipse binary src need to be changed.
>  
> @ [~wastl-nagel] @ [~lewismc]
> last time we changed http to https and this time url got changed.
> can you suggest the best to overcome this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2018-11-30 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704793#comment-16704793
 ] 

Lewis John McGibbney commented on NUTCH-2292:
-

I'm going to take this work on from where [~thammegowda] got to. I'll see if I 
can forward merge master into NUTCH-2292 branch first.

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
> Fix For: 1.16
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Maven vs Gradle for Nutch Build System

2018-11-29 Thread lewis john mcgibbney
Hi Folks,
Seb and I were talking build systems this week. I wanted to get a feel for
what we as a PMC would rather use for the next Nutch build lifecycle.
Personall I've used Maven for many of y Java projects however I have also
really enjoyed working with Gradle.
I would like t start working on the build system such that we can
streamline the Nutch release process. Also we've seen people request Nutch
plugins as Maven artifacts for some time.
Any thoughts?
Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Created] (NUTCH-2677) Update Jest client in indexer-elastic-rest plugin

2018-11-28 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2677:
---

 Summary: Update Jest client in indexer-elastic-rest plugin
 Key: NUTCH-2677
 URL: https://issues.apache.org/jira/browse/NUTCH-2677
 Project: Nutch
  Issue Type: Task
  Components: indexer, plugin
Affects Versions: 1.15
Reporter: Lewis John McGibbney
 Fix For: 1.16


We should really upgrade the dependency to a more recent version
https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/ivy.xml
We are using 2.0.1, the most recent is 6.3.1
https://search.maven.org/artifact/io.searchbox/jest/6.3.1/jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2667) Update Tika and Commons Collections 4

2018-10-23 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2667:

Description: Tika and Commons Collections 4 need to be updated. This issue 
needs to address them.  (was: Tika and Commons Collections 4 need to be updated 
due to known CVE's.
This issue needs to address them.)

> Update Tika and Commons Collections 4
> -
>
> Key: NUTCH-2667
> URL: https://issues.apache.org/jira/browse/NUTCH-2667
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.4
>
>
> Tika and Commons Collections 4 need to be updated. This issue needs to 
> address them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2667) Update Tika and Commons Collections 4

2018-10-23 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2667:
---

 Summary: Update Tika and Commons Collections 4
 Key: NUTCH-2667
 URL: https://issues.apache.org/jira/browse/NUTCH-2667
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.4


Tika and Commons Collections 4 need to be updated due to known CVE's.
This issue needs to address them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2199) Documentation for Nutch 2.X REST API

2018-10-18 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2199.
-
Resolution: Fixed

> Documentation for Nutch 2.X REST API
> 
>
> Key: NUTCH-2199
> URL: https://issues.apache.org/jira/browse/NUTCH-2199
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, REST_api
>Affects Versions: 2.3.1
>    Reporter: Lewis John McGibbney
>Assignee: Furkan KAMACI
>Priority: Minor
> Fix For: 2.5
>
>
> The work done on NUTCH-1800 needs to be ported to 2.X branch. This is 
> trivial, I thought I had already done it but obviously not. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol

2018-08-27 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594347#comment-16594347
 ] 

Lewis John McGibbney commented on NUTCH-1861:
-

[~yossi] the existing JavaMail license is incompatible with ALv2.0 meaning that 
I am going to look at using the following instead
{code}

  org.apache.geronimo.javamail
  geronimo-javamail_1.4
  1.9.0-alpha-2
  pom

{code}

> Implement POP3 Protocol
> ---
>
> Key: NUTCH-1861
> URL: https://issues.apache.org/jira/browse/NUTCH-1861
> Project: Nutch
>  Issue Type: Task
>  Components: protocol
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
>
> Implementing the Post Office Protocol within Nutch would open up a new use 
> case which is crawling and indexing of some mail servers.
> This is particularly useful for investigation purposes or for porting/mapping 
> mail from one server to another. 
> We *may* be able to kil two bird with the one stone by implementing both IMAP 
> and POP3 protocols under the one plugin.
> http://commons.apache.org/proper/commons-net/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol

2018-08-27 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594141#comment-16594141
 ] 

Lewis John McGibbney commented on NUTCH-1861:
-

Hi [~yossi] thanks for response

bq. Isn't SMTP only for sending (and relaying) messages? How can it be used for 
crawling?

When I opened this ticket my understanding of SMTP was much less (it is still 
not great however it is slightly better). Your description sounds correct.

bq. I assume crawling in this instance will be in the context of a specific 
user (with password), but this user may have access to multiple 
mailboxes/folders (at least with IMAP, I don't think POP3 supports such 
features). Do you intend to support multiple users/passwords?

Yes. These URI's would be injected as normal or read from configuration.

bq. Why did you choose Commons Net over JavaMail?

I was not sure which implementation was more suited to what we were looking to 
achieve. If JavaMail is the way to go, then I will code it up using that 
underlying library instead. 

> Implement POP3 Protocol
> ---
>
> Key: NUTCH-1861
> URL: https://issues.apache.org/jira/browse/NUTCH-1861
> Project: Nutch
>  Issue Type: Task
>  Components: protocol
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
>
> Implementing the Post Office Protocol within Nutch would open up a new use 
> case which is crawling and indexing of some mail servers.
> This is particularly useful for investigation purposes or for porting/mapping 
> mail from one server to another. 
> We *may* be able to kil two bird with the one stone by implementing both IMAP 
> and POP3 protocols under the one plugin.
> http://commons.apache.org/proper/commons-net/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol

2018-08-27 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593875#comment-16593875
 ] 

Lewis John McGibbney commented on NUTCH-1861:
-

Hi Folks, using commons-net I was thinking of bulking support for SMTP(S), 
POP3(S) and IMAP(S) into the same _*protocol-email*_ plugin. Is this a better 
architecture choice than separating each implementation out into an individual 
protocol plugin?

> Implement POP3 Protocol
> ---
>
> Key: NUTCH-1861
> URL: https://issues.apache.org/jira/browse/NUTCH-1861
> Project: Nutch
>  Issue Type: Task
>  Components: protocol
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
>
> Implementing the Post Office Protocol within Nutch would open up a new use 
> case which is crawling and indexing of some mail servers.
> This is particularly useful for investigation purposes or for porting/mapping 
> mail from one server to another. 
> We *may* be able to kil two bird with the one stone by implementing both IMAP 
> and POP3 protocols under the one plugin.
> http://commons.apache.org/proper/commons-net/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-1861) Implement POP3 Protocol

2018-08-27 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1861:
---

Assignee: Lewis John McGibbney

> Implement POP3 Protocol
> ---
>
> Key: NUTCH-1861
> URL: https://issues.apache.org/jira/browse/NUTCH-1861
> Project: Nutch
>  Issue Type: Task
>  Components: protocol
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
>
> Implementing the Post Office Protocol within Nutch would open up a new use 
> case which is crawling and indexing of some mail servers.
> This is particularly useful for investigation purposes or for porting/mapping 
> mail from one server to another. 
> We *may* be able to kil two bird with the one stone by implementing both IMAP 
> and POP3 protocols under the one plugin.
> http://commons.apache.org/proper/commons-net/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2633) Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13

2018-08-10 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2633.
-
Resolution: Fixed

We can address Ivy issues in a separate patch.

> Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13
> --
>
> Key: NUTCH-2633
> URL: https://issues.apache.org/jira/browse/NUTCH-2633
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.16
> Environment: java version "10.0.2" 2018-07-17
> Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
> Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)
> Nutch master 01c5d6ea17d7b60d25d4e65462b2a654f10680c3 (Thu Jul 26 14:55:38 
> 2018 +0200)
>    Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.16
>
>
> I just got around to making a dev upgrade to >= JDK 10.
> When building master using environment JDK
> I get several compile time deprecations which are reflected in the attached 
> build log. 
> Additionally, I get some issues with Ivy... see below
> {code}
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.ivy.util.url.IvyAuthenticator 
> (file:/Users/lmcgibbn/.ant/lib/ivy-2.3.0.jar) to field 
> java.net.Authenticator.theAuthenticator
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.ivy.util.url.IvyAuthenticator
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> [ivy:resolve] :: problems summary ::
> [ivy:resolve]  ERRORS
> [ivy:resolve] unknown resolver null
> [ivy:resolve] unknown resolver null
> [ivy:resolve] unknown resolver null
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2633) Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13

2018-08-09 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2633:
---

 Summary: Fix deprecation warnings when building Nutch master 
branch under JDK 10.0.2+13
 Key: NUTCH-2633
 URL: https://issues.apache.org/jira/browse/NUTCH-2633
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.16
 Environment: java version "10.0.2" 2018-07-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)

Nutch master 01c5d6ea17d7b60d25d4e65462b2a654f10680c3 (Thu Jul 26 14:55:38 2018 
+0200)
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.16


I just got around to making a dev upgrade to >= JDK 10.
When building master using environment JDK
I get several compile time deprecations which are reflected in the attached 
build log. 

Additionally, I get some issues with Ivy... see below
{code}
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.ivy.util.url.IvyAuthenticator 
(file:/Users/lmcgibbn/.ant/lib/ivy-2.3.0.jar) to field 
java.net.Authenticator.theAuthenticator
WARNING: Please consider reporting this to the maintainers of 
org.apache.ivy.util.url.IvyAuthenticator
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
[ivy:resolve] :: problems summary ::
[ivy:resolve]  ERRORS
[ivy:resolve]   unknown resolver null
[ivy:resolve]   unknown resolver null
[ivy:resolve]   unknown resolver null
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2018-08-01 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-.
-
Resolution: Fixed

Thank you [~alaffet] and everyone else for attempting to fix. 

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Furkan KAMACI
>Priority: Major
> Fix For: 2.4
>
> Attachments: NUTCH-.patch, TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2018-07-31 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564745#comment-16564745
 ] 

Lewis John McGibbney commented on NUTCH-:
-

[~alaffet] thank you, can you please provide a patch? 

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Furkan KAMACI
>Priority: Major
> Fix For: 2.4
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9

2018-05-22 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484563#comment-16484563
 ] 

Lewis John McGibbney commented on NUTCH-2512:
-

See my comment above...

> Nutch 1.14 does not work under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-04-10 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2539:

Fix Version/s: 1.15

> Not correct naming of db.url.filters and db.url.normalizers in 
> nutch-default.xml
> 
>
> Key: NUTCH-2539
> URL: https://issues.apache.org/jira/browse/NUTCH-2539
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Major
> Fix For: 1.15
>
>
> There is a mismatch between config and code.
> In code, 
>  In CrawlDbFilter line 41:43
> > public static final String URL_FILTERING = "crawldb.url.filters";
> > public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> > public static final String URL_NORMALIZING_SCOPE = 
> > "crawldb.url.normalizers.scope";
>  
> In nutch-default.xml
> > 
> > db.url.normalizers
> > false
> > Normalize urls when updating crawldb
> > 
> >
> > 
> > db.url.filters
> > false
> > Filter urls when updating crawldb
> > 
> These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-04-10 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2539.
-
Resolution: Fixed

> Not correct naming of db.url.filters and db.url.normalizers in 
> nutch-default.xml
> 
>
> Key: NUTCH-2539
> URL: https://issues.apache.org/jira/browse/NUTCH-2539
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Major
> Fix For: 1.15
>
>
> There is a mismatch between config and code.
> In code, 
>  In CrawlDbFilter line 41:43
> > public static final String URL_FILTERING = "crawldb.url.filters";
> > public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> > public static final String URL_NORMALIZING_SCOPE = 
> > "crawldb.url.normalizers.scope";
>  
> In nutch-default.xml
> > 
> > db.url.normalizers
> > false
> > Normalize urls when updating crawldb
> > 
> >
> > 
> > db.url.filters
> > false
> > Filter urls when updating crawldb
> > 
> These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2550) Fetcher fails to follow redirects

2018-04-10 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2550.
-
Resolution: Fixed

> Fetcher fails to follow redirects
> -
>
> Key: NUTCH-2550
> URL: https://issues.apache.org/jira/browse/NUTCH-2550
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Hans Brende
>Priority: Blocker
> Fix For: 1.15
>
>
> As I detailed in this github 
> [comment|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#r28470348],
>  it appears that PR #221 broke redirects. The fetcher will repeatedly fetch 
> the *original url* rather than the one it's supposed to be redirecting to 
> until {{http.redirect.max}} is exceeded, and then end with 
> {{STATUS_FETCH_GONE}}.
> I noticed this issue when I was trying to crawl a site with a 301 MOVED 
> PERMANENTLY status code.
> Should be pretty easy to fix though: I was able to get redirects working 
> again simply by inserting the code {code:java}url = fit.url{code} 
> [here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L388]
>  and 
> [here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L409].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (NUTCH-2545) Upgrade to Any23 2.2

2018-04-02 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2545.
---

> Upgrade to Any23 2.2
> 
>
> Key: NUTCH-2545
> URL: https://issues.apache.org/jira/browse/NUTCH-2545
> Project: Nutch
>  Issue Type: Improvement
>  Components: any23, plugin
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
>
> We recently released Any23 2.2. I would like to update the Any23 plugin to 
> this newest version.
> PR coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2545) Upgrade to Any23 2.2

2018-04-02 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2545.
-
Resolution: Fixed

> Upgrade to Any23 2.2
> 
>
> Key: NUTCH-2545
> URL: https://issues.apache.org/jira/browse/NUTCH-2545
> Project: Nutch
>  Issue Type: Improvement
>  Components: any23, plugin
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
>
> We recently released Any23 2.2. I would like to update the Any23 plugin to 
> this newest version.
> PR coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2536.
-
Resolution: Fixed

> GeneratorReducer.count is a static variable
> ---
>
> Key: NUTCH-2536
> URL: https://issues.apache.org/jira/browse/NUTCH-2536
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>Reporter: Ben Vachon
>Priority: Minor
>  Labels: Generate
> Fix For: 2.4
>
>   Original Estimate: 2.4h
>  Remaining Estimate: 2.4h
>
> The count field of the GeneratorReducer class is a static field. This means 
> that if the GeneratorJob is run multiple times within the same JVM, it will 
> count all the webpages generated across all batches.
> The count field is checked against the GeneratorJob's topN configuration 
> variable, which is described as:
> "top threshold for maximum number of URLs permitted in a batch"
> I understand this to mean that EACH batch should be capped at the topN value, 
> not ALL batches.
> This isn't a problem with the way that Nutch is typically used because the 
> script starts a new JVM each time. I'm not using the script, I'm calling the 
> java classes directly (using the ToolRunner) within an existing JVM, so I'm 
> categorizing this as an SDK issue.
> Changing the field to be non-static will not affect the behavior of the class 
> as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2545) Upgrade to Any23 2.2

2018-03-27 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2545:
---

 Summary: Upgrade to Any23 2.2
 Key: NUTCH-2545
 URL: https://issues.apache.org/jira/browse/NUTCH-2545
 Project: Nutch
  Issue Type: Improvement
  Components: any23, plugin
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.15


We recently released Any23 2.2. I would like to update the Any23 plugin to this 
newest version.
PR coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work stopped] (NUTCH-2516) Hadoop imports use wildcards

2018-03-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2516 stopped by Lewis John McGibbney.
---
> Hadoop imports use wildcards
> 
>
> Key: NUTCH-2516
> URL: https://issues.apache.org/jira/browse/NUTCH-2516
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Right now the Hadoop imports use wildcards all over the place. 
> We wanted to address this during NUTCH-2375 but didn't get around to it.
> We should address it in a new issue as it is still important.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()

2018-03-27 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415429#comment-16415429
 ] 

Lewis John McGibbney commented on NUTCH-2518:
-

Yes please do [~omkar20895] Thank you

> Must check return value of job.waitForCompletion()
> --
>
> Key: NUTCH-2518
> URL: https://issues.apache.org/jira/browse/NUTCH-2518
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher, generator, hostdb, linkdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Kenneth McFarland
>Priority: Blocker
> Fix For: 1.15
>
>
> The return value of job.waitForCompletion() of the new MapReduce API 
> (NUTCH-2375) must always be checked. If it's not true, the job has been 
> failed or killed. Accordingly, the program
> - should not proceed with further jobs/steps
> - must clean-up temporary data, unlock CrawlDB, etc.
> - exit with non-zero exit value, so that scripts running the crawl workflow 
> can handle the failure
> Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR 
> #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (NUTCH-2516) Hadoop imports use wildcards

2018-03-14 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2516 started by Lewis John McGibbney.
---
> Hadoop imports use wildcards
> 
>
> Key: NUTCH-2516
> URL: https://issues.apache.org/jira/browse/NUTCH-2516
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Right now the Hadoop imports use wildcards all over the place. 
> We wanted to address this during NUTCH-2375 but didn't get around to it.
> We should address it in a new issue as it is still important.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398025#comment-16398025
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Correct [~wastl-nagel]

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>    Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2427) Remove all the Hadoop wildcard imports.

2018-03-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2427.
-
Resolution: Duplicate

> Remove all the Hadoop wildcard imports.
> ---
>
> Key: NUTCH-2427
> URL: https://issues.apache.org/jira/browse/NUTCH-2427
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Omkar Reddy
>Priority: Minor
>  Labels: easyfix
>
> This improvement deals with removing the wildcard imports like "import 
> org.apache.hadoop.package.* "



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Upgrade to Hadoop 3

2018-03-13 Thread Lewis John McGibbney
Hi Seb,

On 2018/03/12 11:00:52, Sebastian Nagel  wrote: 
> Hi,
> 
> > seeing as we have just merged in the 'new' MR patch
> 
> yep, but there's still something to do (NUTCH-2517, 

ACK, this needs more testing.

> NUTCH-2518).

I honestly didn't see this come through but yes you are right.

> Better to address this before any upgrade of the Hadoop version.

ACK

> But since there seem to be no breaking MapReduce API changes
>   http://hadoop.apache.org/docs/r3.0.0/index.html
> I would even expect that the Nutch job jar (built for 2.7)
> will run on Hadoop 3.0, or does it not?
> 

I have absolutely no idea. I've certainly not had an opportunity to run on H v3 
cluster.


[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()

2018-03-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397808#comment-16397808
 ] 

Lewis John McGibbney commented on NUTCH-2518:
-

Hi [~wastl-nagel] I think we just overwrote this (as oppose to commit being 
lost).
I can submit a PR to bring the previous functionality back however there is 
some additional work to be done to address the three bullet points you've 
highlighted.

> Must check return value of job.waitForCompletion()
> --
>
> Key: NUTCH-2518
> URL: https://issues.apache.org/jira/browse/NUTCH-2518
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher, generator, hostdb, linkdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> The return value of job.waitForCompletion() of the new MapReduce API 
> (NUTCH-2375) must always be checked. If it's not true, the job has been 
> failed or killed. Accordingly, the program
> - should not proceed with further jobs/steps
> - must clean-up temporary data, unlock CrawlDB, etc.
> - exit with non-zero exit value, so that scripts running the crawl workflow 
> can handle the failure
> Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR 
> #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Upgrade to Hadoop 3

2018-03-13 Thread Lewis John McGibbney
Hi RRK,
Response inline

On 2018/03/08 01:46:18, BlackIce  wrote: 

> 
> Why do you say "Is it too early"? Could you please elaborate on this, thnx.
> 

What I mean is that maybe a lot of people have not upgraded existing 
infrastructure to Hadoop 3 yet. People don't usually move large installations 
for some time... that was all :)
Lewis



[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391440#comment-16391440
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Can anyone else confirm the above ?

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>    Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391430#comment-16391430
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Hi [~mebbinghaus] I ran it from the Docker container and can reproduce some of 
your results, there is one nuance however. I'll explain below
When I run  mergesegs and inspect the data structures created within the 
mycrawl/MERGEDsegments/segment/... I see BOTH crawl_generate and crawl_parse. 
So there must be something wrong with your crawl cycle for you only to have 
generated on directory. I'll leave that down to you to confirm.

The other issue however is that when I attempt to invertlinks using one of the 
merged segs, I end up with the same stack track as you so i am looking into the 
code right now.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>    Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch

Upgrade to Hadoop 3

2018-03-07 Thread lewis john mcgibbney
Hi Folks,
Before we get started with GSoC again, and seeing as we have just merged in
the 'new' MR patch, I wonder if folks are partial to migration to Hadoop 3?
Is it too early?
Comments?
Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388718#comment-16388718
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Should be noted that I didn't run this from the Docker container.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2517:
---

Assignee: Lewis John McGibbney

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>    Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650
 ] 

Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 11:09 PM:
--

I cannot reproduce this... see below for tests

{code}

//inject

/usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb 
urls/seed.txt
Injector: starting at 2018-03-06 14:31:10
Injector: crawlDb: mycrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01
{code}

{code}

//simple 'ls' to see what we have

/usr/local/nutch(master) $ ls mycrawl/crawldb/
current/ old/
{code}

{code}
// generate

/usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb 
mycrawl/segments 1
Generator: starting at 2018-03-06 14:31:37
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: mycrawl/segments/20180306143139
Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03
{code}

{code}
//fetch

/usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch 
mycrawl/segments/20180306143139 -threads 2
Fetcher: starting at 2018-03-06 14:32:15
Fetcher: segment: mycrawl/segments/20180306143139
Fetcher: threads: 2
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records hit by time limit :0
FetcherThread 36 Using queue mode : byHost
FetcherThread 36 Using queue mode : byHost
FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
FetcherThread 41 has no more work available
FetcherThread 41 -finishing thread FetcherThread, activeThreads=1
robots.txt whitelist not configured.
FetcherThread 40 has no more work available
FetcherThread 40 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02
{code}

{code}
//parse

/usr/local/nutch(master) $ ./runtime/local/bin/nutch parse 
mycrawl/segments/20180306143139 -threads 2
ParseSegment: starting at 2018-03-06 14:32:45
ParseSegment: segment: mycrawl/segments/20180306143139
Parsed (140ms):http://nutch.apache.org:-1/
ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
/usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/
content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ 
parse_text/
{code}

{code}
//updatedb

/usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180306143139/
CrawlDb update: starting at 2018-03-06 14:33:40
CrawlDb update: db: mycrawl/crawldb
CrawlDb update: segments: [mycrawl/segments/20180306143139]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01
{code}

{code}
//lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
{code}

{code}
//mergesegs with -dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter
Merging 1 segments to mycrawl/MERGEDsegments/20180306143518
SegmentMerger:   adding file:/usr/local/nutch/mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
MERGEDsegments/ crawldb/segments/

/usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_
crawl_generate/ crawl_parse/
{code}

{code}
//mergesegs with single segment directory without dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter
Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617
SegmentMerger:   adding mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// mergesegs with array of segment directories

lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch

[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650
 ] 

Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 10:50 PM:
--

I cannot reproduce this... see below for tests

{code}

//inject

/usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb 
urls/seed.txt
Injector: starting at 2018-03-06 14:31:10
Injector: crawlDb: mycrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01
{code}

{code}

//simple 'ls' to see what we have

/usr/local/nutch(master) $ ls mycrawl/crawldb/
current/ old/
{code}

{code}
// generate

/usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb 
mycrawl/segments 1
Generator: starting at 2018-03-06 14:31:37
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: mycrawl/segments/20180306143139
Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03
{code}

{code}
//fetch

/usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch 
mycrawl/segments/20180306143139 -threads 2
Fetcher: starting at 2018-03-06 14:32:15
Fetcher: segment: mycrawl/segments/20180306143139
Fetcher: threads: 2
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records hit by time limit :0
FetcherThread 36 Using queue mode : byHost
FetcherThread 36 Using queue mode : byHost
FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
FetcherThread 41 has no more work available
FetcherThread 41 -finishing thread FetcherThread, activeThreads=1
robots.txt whitelist not configured.
FetcherThread 40 has no more work available
FetcherThread 40 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02
{code}

{code}
//parse

/usr/local/nutch(master) $ ./runtime/local/bin/nutch parse 
mycrawl/segments/20180306143139 -threads 2
ParseSegment: starting at 2018-03-06 14:32:45
ParseSegment: segment: mycrawl/segments/20180306143139
Parsed (140ms):http://nutch.apache.org:-1/
ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
/usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/
content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ 
parse_text/
{code}

{code}
//updatedb

/usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180306143139/
CrawlDb update: starting at 2018-03-06 14:33:40
CrawlDb update: db: mycrawl/crawldb
CrawlDb update: segments: [mycrawl/segments/20180306143139]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01
{code}

{code}
//lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
{code}

{code}
//mergesegs with -dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter
Merging 1 segments to mycrawl/MERGEDsegments/20180306143518
SegmentMerger:   adding file:/usr/local/nutch/mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
MERGEDsegments/ crawldb/segments/

/usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_
crawl_generate/ crawl_parse/
{code}

{code}
//mergesegs with single segment directory without dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter
Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617
SegmentMerger:   adding mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// mergesegs with array of segment directories

lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch

[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650
 ] 

Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 10:49 PM:
--

I cannot reproduce this... see below for tests

{code}

//inject

/usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb 
urls/seed.txt
Injector: starting at 2018-03-06 14:31:10
Injector: crawlDb: mycrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01
{code}

{code}

//simple 'ls' to see what we have

/usr/local/nutch(master) $ ls mycrawl/crawldb/
current/ old/
{code}

{code}
// generate

/usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb 
mycrawl/segments 1
Generator: starting at 2018-03-06 14:31:37
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: mycrawl/segments/20180306143139
Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03
{code}

{code}
//fetch

/usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch 
mycrawl/segments/20180306143139 -threads 2
Fetcher: starting at 2018-03-06 14:32:15
Fetcher: segment: mycrawl/segments/20180306143139
Fetcher: threads: 2
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records hit by time limit :0
FetcherThread 36 Using queue mode : byHost
FetcherThread 36 Using queue mode : byHost
FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
FetcherThread 41 has no more work available
FetcherThread 41 -finishing thread FetcherThread, activeThreads=1
robots.txt whitelist not configured.
FetcherThread 40 has no more work available
FetcherThread 40 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02
{code}

{code}
//parse

/usr/local/nutch(master) $ ./runtime/local/bin/nutch parse 
mycrawl/segments/20180306143139 -threads 2
ParseSegment: starting at 2018-03-06 14:32:45
ParseSegment: segment: mycrawl/segments/20180306143139
Parsed (140ms):http://nutch.apache.org:-1/
ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
/usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/
content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ 
parse_text/
{code}

{code}
//updatedb

/usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180306143139/
CrawlDb update: starting at 2018-03-06 14:33:40
CrawlDb update: db: mycrawl/crawldb
CrawlDb update: segments: [mycrawl/segments/20180306143139]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01
{code}

{code}
//lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/

{code}
//mergesegs with -dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter
Merging 1 segments to mycrawl/MERGEDsegments/20180306143518
SegmentMerger:   adding file:/usr/local/nutch/mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
MERGEDsegments/ crawldb/segments/

/usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_
crawl_generate/ crawl_parse/
{code}

{code}
//mergesegs with single segment directory without dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter
Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617
SegmentMerger:   adding mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// mergesegs with array of segment directories

lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

I cannot reproduce this... see below for tests

{code}

//inject

/usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb 
urls/seed.txt
Injector: starting at 2018-03-06 14:31:10
Injector: crawlDb: mycrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01
{code}

{code}

//simple 'ls' to see what we have

/usr/local/nutch(master) $ ls mycrawl/crawldb/
current/ old/
{code}

{code}
// generate

/usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb 
mycrawl/segments 1
Generator: starting at 2018-03-06 14:31:37
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: mycrawl/segments/20180306143139
Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03
{code}

{code}
//fetch

/usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch 
mycrawl/segments/20180306143139 -threads 2
Fetcher: starting at 2018-03-06 14:32:15
Fetcher: segment: mycrawl/segments/20180306143139
Fetcher: threads: 2
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records hit by time limit :0
FetcherThread 36 Using queue mode : byHost
FetcherThread 36 Using queue mode : byHost
FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
FetcherThread 41 has no more work available
FetcherThread 41 -finishing thread FetcherThread, activeThreads=1
robots.txt whitelist not configured.
FetcherThread 40 has no more work available
FetcherThread 40 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02
{code}

{code}
//parse

/usr/local/nutch(master) $ ./runtime/local/bin/nutch parse 
mycrawl/segments/20180306143139 -threads 2
ParseSegment: starting at 2018-03-06 14:32:45
ParseSegment: segment: mycrawl/segments/20180306143139
Parsed (140ms):http://nutch.apache.org:-1/
ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
/usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/
content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ 
parse_text/
{code}

{code}
//updatedb

/usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180306143139/
CrawlDb update: starting at 2018-03-06 14:33:40
CrawlDb update: db: mycrawl/crawldb
CrawlDb update: segments: [mycrawl/segments/20180306143139]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01
{code}

{code}
//lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/

{code}
//mergesegs with -dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter
Merging 1 segments to mycrawl/MERGEDsegments/20180306143518
SegmentMerger:   adding file:/usr/local/nutch/mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
MERGEDsegments/ crawldb/segments/

/usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_
crawl_generate/ crawl_parse/
{code}

{code}
//mergesegs with single segment directory without dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter
Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617
SegmentMerger:   adding mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// mergesegs with array of segment directories

lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch 
mergesegs mycrawl/MERGEDsegments3 mycrawl/segments

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386469#comment-16386469
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Thank you [~mebbinghaus] for reporting. This appears to be a major bug and 
hence a blocker for the next release. I will begin work on a solution ASAP.
FYI [~omkar20895] this is post Hadoop upgrade.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-05 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2517:

Priority: Blocker  (was: Major)

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-04 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2517:

Fix Version/s: 1.15

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Major
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Apache Nutch - Exception since last commit

2018-03-03 Thread lewis john mcgibbney
Hello Marco,
Thank you very much for the information. Please register the issue on jira.
I will personally look into it and make best efforts to fix the bug if one
exists.
Please provide details as to how I can reproduce.
Thanks,
Lewis

On Sat, Mar 3, 2018 at 09:14 Marco Ebbinghaus 
wrote:

> Hello Lewis,
>
> I just wanted to let you know, that I am experiencing problems with
> nutch since your last merge 4 days ago. I am using the latest-tagged
> docker image version of apache/nutch.
>
> On my live system (which is some weeks old), everything works fine. But
> since the last local image-repull today I cannot get nutch working. I am
> using a script which does inject, generate, fetch, parse, updatedb,
> mergesegs, invertlinks, index circles.
>
> Everything works fine until the merging / invertlinks. I have three
> Segments-Folders with all required subfolders. But it seems like the
> merging of the three segments isn't done correct, so that the merged
> segment-folder is not complete. I can absolutely reproduce this.
>
> I haven't further investigated the problem and I have no more time
> today. But I wanted to inform you already. Maybe you have an idea. Maybe
> I will have some time to further investigate the problem tomorrow.
>
> Here are the commands that are executed:
>
>
>  $NUTCH_HOME/bin/nutch mergesegs
> $NUTCH_HOME/$crawlDirName/MERGEDsegments
> $NUTCH_HOME/$crawlDirName/segments/* -filter
>
>  rm $RMARGS $NUTCH_HOME/$crawlDirName/segments
>
>  mv $MVARGS $NUTCH_HOME/$crawlDirName/MERGEDsegments
> $NUTCH_HOME/$crawlDirName/segments
>
> $NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/$crawlDirName/linkdb -dir
> $NUTCH_HOME/$crawlDirName/segments
>
>
> and I will attach a screenshot with the stacktrace.
>
>
> Greetings,
>
>
> Marco Ebbinghaus
>
> --
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Updated] (NUTCH-2516) Hadoop imports use wildcards

2018-02-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2516:

Description: 
Right now the Hadoop imports use wildcards all over the place. 
We wanted to address this during NUTCH-2375 but didn't get around to it.
We should address it in a new issue as it is still important.

  was:
Right now the Hadoop imports use wildcards all over the place. 
We wanted to address this during NUTCH-2371 but didn't get around to it.
We should address it in a new issue as it is still important.


> Hadoop imports use wildcards
> 
>
> Key: NUTCH-2516
> URL: https://issues.apache.org/jira/browse/NUTCH-2516
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Right now the Hadoop imports use wildcards all over the place. 
> We wanted to address this during NUTCH-2375 but didn't get around to it.
> We should address it in a new issue as it is still important.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2516) Hadoop imports use wildcards

2018-02-27 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2516:
---

 Summary: Hadoop imports use wildcards
 Key: NUTCH-2516
 URL: https://issues.apache.org/jira/browse/NUTCH-2516
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.14
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.15


Right now the Hadoop imports use wildcards all over the place. 
We wanted to address this during NUTCH-2371 but didn't get around to it.
We should address it in a new issue as it is still important.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2018-02-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2375:
---

Assignee: Lewis John McGibbney

> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> --
>
> Key: NUTCH-2375
> URL: https://issues.apache.org/jira/browse/NUTCH-2375
> Project: Nutch
>  Issue Type: Improvement
>  Components: deployment
>Affects Versions: 1.13
>Reporter: Omkar Reddy
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9

2018-02-22 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373173#comment-16373173
 ] 

Lewis John McGibbney commented on NUTCH-2512:
-

Hi [~Bl4ck1c3] thanks for logging the issue... this behavior was to be 
'expected'. As you can see from the 
[javac.version|https://github.com/apache/nutch/blob/release-1.14/default.properties#L60]
 we had it pinned at 1.8.
I suppose we can make the upgrade for 1.15... (queue patch ;))

> Nutch 1.14 does not work under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2512) Nutch 1.14 does not work under JDK9

2018-02-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2512:

Fix Version/s: 1.15

> Nutch 1.14 does not work under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2018-02-07 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2489.
-
Resolution: Fixed

Thank you [~yossi]

> Dependency collision with lucene-analyzers-common in scoring-similarity plugin
> --
>
> Key: NUTCH-2489
> URL: https://issues.apache.org/jira/browse/NUTCH-2489
> Project: Nutch
>  Issue Type: Bug
>  Components: scoring
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml.patch
>
>
> After updating to Master branch of 1.14, we get a few compile errors in 
> LuceneTokenizer.java and LuceneAnalyzerUtil.java:
> {code:java}
> Type mismatch: cannot convert from org.apache.lucene.analysis.CharArraySet to 
> org.apache.lucene.analysis.util.CharArraySet
> {code}
> This seems to be caused by the fact that scoring-similarity compiles with 
> lucene-analyzers-common-5.5.0.jar (from ivy.xml), but with lucene-core-6.4.1 
> instead of the matching 5.5.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2018-02-07 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2489:

Fix Version/s: 1.15

> Dependency collision with lucene-analyzers-common in scoring-similarity plugin
> --
>
> Key: NUTCH-2489
> URL: https://issues.apache.org/jira/browse/NUTCH-2489
> Project: Nutch
>  Issue Type: Bug
>  Components: scoring
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml.patch
>
>
> After updating to Master branch of 1.14, we get a few compile errors in 
> LuceneTokenizer.java and LuceneAnalyzerUtil.java:
> {code:java}
> Type mismatch: cannot convert from org.apache.lucene.analysis.CharArraySet to 
> org.apache.lucene.analysis.util.CharArraySet
> {code}
> This seems to be caused by the fact that scoring-similarity compiles with 
> lucene-analyzers-common-5.5.0.jar (from ivy.xml), but with lucene-core-6.4.1 
> instead of the matching 5.5.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2508.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2508:

Fix Version/s: 1.15

> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2018-01-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341326#comment-16341326
 ] 

Lewis John McGibbney commented on NUTCH-2369:
-

Hi [~markus17] the idea here was to export full graph information into 
something that could be interpreted by [Tinkerpop|http://tinkpop.apache.org] 
and queried using [Gremlin|https://tinkerpop.apache.org/gremlin.html].

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: gsoc2017, gsoc2018
> Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2369:

Labels: gsoc2017 gsoc2018  (was: gsoc2017)

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: gsoc2017, gsoc2018
> Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2502.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2502:

Fix Version/s: 1.15

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2499:

Fix Version/s: 1.15

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2499.
-
Resolution: Fixed

Thank you [~mfeltscher]

 

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2503.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2503:

Fix Version/s: 1.15

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2497) Elastic REST Indexer: Allow multiple hosts

2018-01-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2497.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Elastic REST Indexer: Allow multiple hosts
> --
>
> Key: NUTCH-2497
> URL: https://issues.apache.org/jira/browse/NUTCH-2497
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Allow specifying a list of Elasticsearch hosts to index documents to. This 
> would be especially helpful when working with a Elasticsearch cluster which 
> contains of multiple nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2497) Elastic REST Indexer: Allow multiple hosts

2018-01-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2497:

Fix Version/s: 1.15

> Elastic REST Indexer: Allow multiple hosts
> --
>
> Key: NUTCH-2497
> URL: https://issues.apache.org/jira/browse/NUTCH-2497
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Allow specifying a list of Elasticsearch hosts to index documents to. This 
> would be especially helpful when working with a Elasticsearch cluster which 
> contains of multiple nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2461) Generate passes the data to when maxCount == 0

2018-01-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2461.
-
Resolution: Fixed

Thank you [~semyon.semyo...@mail.com]

> Generate passes the data to when maxCount  == 0
> ---
>
> Key: NUTCH-2461
> URL: https://issues.apache.org/jira/browse/NUTCH-2461
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.14
>Reporter: Semyon Semyonov
>Priority: Critical
> Fix For: 1.15
>
>
> The generator checks condition 
> if (maxCount > 0) : line 421 and stop the generation when amount per host 
> exceeds maxCount( continue : line 455)
> but when  maxCount == 0 it goes directly to line 465 :output.collect(key, 
> entry);
> It is obviously not correct, the correct solution would be to add 
> if(maxCount == 0){
>   continue;
> }
> at line 380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2321) Indexing filter checker leaks threads

2018-01-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326453#comment-16326453
 ] 

Lewis John McGibbney commented on NUTCH-2321:
-

Thank you [~jurian]

> Indexing filter checker leaks threads
> -
>
> Key: NUTCH-2321
> URL: https://issues.apache.org/jira/browse/NUTCH-2321
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2321.patch
>
>
> Same issue as NUTCH-2320.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2321) Indexing filter checker leaks threads

2018-01-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2321.
-
Resolution: Fixed

> Indexing filter checker leaks threads
> -
>
> Key: NUTCH-2321
> URL: https://issues.apache.org/jira/browse/NUTCH-2321
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2321.patch
>
>
> Same issue as NUTCH-2320.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1129:

Fix Version/s: (was: 2.5)
   1.15

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1129.
-
Resolution: Fixed

Thank you [~mfeltscher] this is great

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script

2018-01-10 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2493.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Add configuration parameter for sitemap processing to crawler script
> 
>
> Key: NUTCH-2493
> URL: https://issues.apache.org/jira/browse/NUTCH-2493
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
> Fix For: 1.15
>
>
> While using the crawler script with the sitemap processing feature introduced 
> in NUTCH-2491 I encountered some performance issues when working with large 
> sitemaps.
> Therefore one should be able to specify if sitemap processing based on HostDB 
> should take place and if so how frequently it should be done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script

2018-01-10 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2493:

Fix Version/s: 1.15

> Add configuration parameter for sitemap processing to crawler script
> 
>
> Key: NUTCH-2493
> URL: https://issues.apache.org/jira/browse/NUTCH-2493
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
> Fix For: 1.15
>
>
> While using the crawler script with the sitemap processing feature introduced 
> in NUTCH-2491 I encountered some performance issues when working with large 
> sitemaps.
> Therefore one should be able to specify if sitemap processing based on HostDB 
> should take place and if so how frequently it should be done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2324) Issue in setting default linkdb path

2018-01-09 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2324.
-
Resolution: Fixed

Thank you [~sachin]

> Issue in setting default linkdb path 
> -
>
> Key: NUTCH-2324
> URL: https://issues.apache.org/jira/browse/NUTCH-2324
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.12
>Reporter: Sachin
>Priority: Minor
> Fix For: 1.15
>
>
> There is an extra if condition that prevents setting default linkdb path if 
> we doesn't provide one in REST call.
>  
> Check this : 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexingJob.java#L272
> https://github.com/apache/nutch/pull/153
> PS : Don't know whether it is intentional. You may check!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2324) Issue in setting default linkdb path

2018-01-09 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2324:

Fix Version/s: 1.15

> Issue in setting default linkdb path 
> -
>
> Key: NUTCH-2324
> URL: https://issues.apache.org/jira/browse/NUTCH-2324
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.12
>Reporter: Sachin
>Priority: Minor
> Fix For: 1.15
>
>
> There is an extra if condition that prevents setting default linkdb path if 
> we doesn't provide one in REST call.
>  
> Check this : 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexingJob.java#L272
> https://github.com/apache/nutch/pull/153
> PS : Don't know whether it is intentional. You may check!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2492) Add more configuration parameters to crawl script

2018-01-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2492.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Add more configuration parameters to crawl script 
> --
>
> Key: NUTCH-2492
> URL: https://issues.apache.org/jira/browse/NUTCH-2492
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
> Fix For: 1.15
>
>
> Instead of having to copy and adjust the crawl script in order to specify the 
> following configuration options allow the user to pass them in using 
> arguments:
> - numSlaves
> - numTasks
> - sizeFetchlist
> - timeLimitFetch
> - numThreads



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


<    1   2   3   4   5   6   7   8   9   10   >