[jira] [Commented] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology
[ https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155635#comment-17155635 ] Lewis John McGibbney commented on NUTCH-2802: - [~snagel] thanks for opening this one. I'll go ahead and create a PR shortly. > Replace blacklist/whitelist by more inclusive and precise terminology > - > > Key: NUTCH-2802 > URL: https://issues.apache.org/jira/browse/NUTCH-2802 > Project: Nutch > Issue Type: Improvement > Components: configuration, plugin > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > The terms blacklist and whitelist should be replaced by a more inclusive and > more precise terminology, see the proposal and discussion on the @dev mailing > list > ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E], > > [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]). > This is an umbrella issue, subtasks to be opened for individual plugins and > configuration properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology
[ https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2802: --- Assignee: Lewis John McGibbney > Replace blacklist/whitelist by more inclusive and precise terminology > - > > Key: NUTCH-2802 > URL: https://issues.apache.org/jira/browse/NUTCH-2802 > Project: Nutch > Issue Type: Improvement > Components: configuration, plugin > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > The terms blacklist and whitelist should be replaced by a more inclusive and > more precise terminology, see the proposal and discussion on the @dev mailing > list > ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E], > > [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]). > This is an umbrella issue, subtasks to be opened for individual plugins and > configuration properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[PROPOSAL] Replace whitelist blacklist with allowlist denylist
Hi Folks, *What* I would like to propose that we replace source code coining 'whiteList' and 'blackList'-esque terms/phrases with some more representative language e.g. allowList, denyList. *Where* * subcollection plugin - https://github.com/apache/nutch/blob/master/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java#L46-L47 * urlfilter-domainblacklist plugin - https://github.com/apache/nutch/tree/master/src/plugin/urlfilter-domainblacklist *Why* I think we could and should use more neutral terminology and lead by example. I want to STRESS that this proposal is by no means an effort by me to reflect negatively on the authors or their EXCELLENT contributions to Nutch. I hope this is taken in good faith and we as a community can come together on this one. *How* Please voice your opinions here and we can take it from there. I would personally love to hear all opinions and I will personally take any action(s) if we decide to go forward with the proposal. Thank you for your consideration folks. Lewis -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Interest in static source code analysis with sonarcloud.io
Hi dev@, I posted on this topic previously but cannot find the thread. We'll it turns out that we have made a slight bit of progress. See https://issues.apache.org/jira/browse/INFRA-19474 for context. Is anyone else registered in sonarcloud.io? If so, can you please update INFRA-19474 as follows https://s.apache.org/oecsn Thank you Lewis -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Assigned] (NUTCH-1863) Add JSON format dump output to readdb command
[ https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1863: --- Assignee: Shashanka Balakuntala Srinivasa > Add JSON format dump output to readdb command > - > > Key: NUTCH-1863 > URL: https://issues.apache.org/jira/browse/NUTCH-1863 > Project: Nutch > Issue Type: New Feature > Components: crawldb >Affects Versions: 2.3, 1.10 > Reporter: Lewis John McGibbney >Assignee: Shashanka Balakuntala Srinivasa >Priority: Major > > Opening up the ability for third parties to consume Nutch crawldb data as > JSON would be a poisitive thing IMHO. > This issue should improve the readdb functionality of both 1.X to enable JSON > dumps of crawldb data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1863) Add JSON format dump output to readdb command
[ https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987208#comment-16987208 ] Lewis John McGibbney commented on NUTCH-1863: - +1, please go ahead [~balaShashanka] > Add JSON format dump output to readdb command > - > > Key: NUTCH-1863 > URL: https://issues.apache.org/jira/browse/NUTCH-1863 > Project: Nutch > Issue Type: New Feature > Components: crawldb >Affects Versions: 2.3, 1.10 > Reporter: Lewis John McGibbney >Priority: Major > > Opening up the ability for third parties to consume Nutch crawldb data as > JSON would be a poisitive thing IMHO. > This issue should improve the readdb functionality of both 1.X and 2.X to > enable JSON dumps of crawldb data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-1863) Add JSON format dump output to readdb command
[ https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1863: Description: Opening up the ability for third parties to consume Nutch crawldb data as JSON would be a poisitive thing IMHO. This issue should improve the readdb functionality of both 1.X to enable JSON dumps of crawldb data. was: Opening up the ability for third parties to consume Nutch crawldb data as JSON would be a poisitive thing IMHO. This issue should improve the readdb functionality of both 1.X and 2.X to enable JSON dumps of crawldb data. > Add JSON format dump output to readdb command > - > > Key: NUTCH-1863 > URL: https://issues.apache.org/jira/browse/NUTCH-1863 > Project: Nutch > Issue Type: New Feature > Components: crawldb >Affects Versions: 2.3, 1.10 > Reporter: Lewis John McGibbney >Priority: Major > > Opening up the ability for third parties to consume Nutch crawldb data as > JSON would be a poisitive thing IMHO. > This issue should improve the readdb functionality of both 1.X to enable JSON > dumps of crawldb data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Static source code anlysis via sonarcloud.io
Hi dev@, Quick heads up, I am working on sonarcloud.io analysis for Nutch master branch. Reasoning being, that I did this previously whilst we hosted SonarQube internally at Apache... but didn't really do anything about it. This is a renewed attempt to study the improvements which can be made on the Nutch source code. I'll update once I have news. Best Lewis -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Comment Edited] (NUTCH-2677) Update Jest client in indexer-elastic-rest plugin
[ https://issues.apache.org/jira/browse/NUTCH-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963221#comment-16963221 ] Lewis John McGibbney edited comment on NUTCH-2677 at 10/30/19 4:58 PM: --- [~balaShashanka] bq. can i work on this issue? yes bq. Or is anybody working on this already? No, however please see NUTCH-2739 which supersedes this issue. Hopefully the description provides enough information. was (Author: lewismc): [~balaShashanka] bq. can i work on this issue? yes bq. Or is anybody working on this already? No Hopefully the description provides enough information. > Update Jest client in indexer-elastic-rest plugin > - > > Key: NUTCH-2677 > URL: https://issues.apache.org/jira/browse/NUTCH-2677 > Project: Nutch > Issue Type: Task > Components: indexer, plugin >Affects Versions: 1.15 > Reporter: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > > We should really upgrade the dependency to a more recent version > https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/ivy.xml > We are using 2.0.1, the most recent is 6.3.1 > https://search.maven.org/artifact/io.searchbox/jest/6.3.1/jar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2677) Update Jest client in indexer-elastic-rest plugin
[ https://issues.apache.org/jira/browse/NUTCH-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963221#comment-16963221 ] Lewis John McGibbney commented on NUTCH-2677: - [~balaShashanka] bq. can i work on this issue? yes bq. Or is anybody working on this already? No Hopefully the description provides enough information. > Update Jest client in indexer-elastic-rest plugin > - > > Key: NUTCH-2677 > URL: https://issues.apache.org/jira/browse/NUTCH-2677 > Project: Nutch > Issue Type: Task > Components: indexer, plugin >Affects Versions: 1.15 > Reporter: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > > We should really upgrade the dependency to a more recent version > https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/ivy.xml > We are using 2.0.1, the most recent is 6.3.1 > https://search.maven.org/artifact/io.searchbox/jest/6.3.1/jar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[SECURITY] Nutch 2.3.1 affected by downstream dependency CVE-2016-6809
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809 Vulnerable Versions: 2.3.1 (1.16 is not vulnerable) Disclosure date: 2018-10-22 Credit: Pierre Ernst, Salesforce Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site containing malicious content Description: The reporter found an RCE security vulnerability in Nutch 2.3.1 when crawling a web site that links a doctored Matlab file. This was due to unsafe deserialization of user generated content. The root cause is 2 outdated 3rd party dependencies: 1. Apache Tika version 1.10 (CVE-2016-6809) 2. Apache Commons Collections 4 version 4.0 (COLLECTIONS-580) Upgrading these 2 dependencies to the latest version will fix the issue. Resolution: The Apache Nutch Project Management Committee released Apache Nutch 2.4 on 2019-10-11 (https://s.apache.org/uw8i3). All users of the 2.X branch should upgrade to this version immediately. In addition, note that we expect that v2.4 is the last release on the 2.x series. The Nutch PMC decided to freeze the development on the 2.x branch for now, as no committers are actively working on it. See the above hyperlink for more information on upgrading and the 2.x retirement decision. Contact: either dev[at] or private[at]nutch[dot]apache[dot]org depending on the nature of your contact. Regards lewismc (On behalf of the Apache Nutch PMC) -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Work stopped] (NUTCH-2307) Implement Missing NutchServer REST API Tests
[ https://issues.apache.org/jira/browse/NUTCH-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2307 stopped by Lewis John McGibbney. --- > Implement Missing NutchServer REST API Tests > > > Key: NUTCH-2307 > URL: https://issues.apache.org/jira/browse/NUTCH-2307 > Project: Nutch > Issue Type: Improvement > Components: REST_api, web gui >Affects Versions: 2.3.1 >Reporter: Furkan Kamaci > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 2.5 > > > TestAPI.java was all commented. Reason was indicated as: > {quote} > CURRENTLY DISABLED. TESTS ARE FLAPPING FOR NO APPARENT REASON. > SHALL BE FIXED OR REPLACES BY NEW API IMPLEMENTATION > {quote} > So, we should implement that missing tests based on new > AbstractNutchAPITestBase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-2307) Implement Missing NutchServer REST API Tests
[ https://issues.apache.org/jira/browse/NUTCH-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2307: --- Assignee: Lewis John McGibbney > Implement Missing NutchServer REST API Tests > > > Key: NUTCH-2307 > URL: https://issues.apache.org/jira/browse/NUTCH-2307 > Project: Nutch > Issue Type: Improvement > Components: REST_api, web gui >Affects Versions: 2.3.1 >Reporter: Furkan Kamaci > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 2.5 > > > TestAPI.java was all commented. Reason was indicated as: > {quote} > CURRENTLY DISABLED. TESTS ARE FLAPPING FOR NO APPARENT REASON. > SHALL BE FIXED OR REPLACES BY NEW API IMPLEMENTATION > {quote} > So, we should implement that missing tests based on new > AbstractNutchAPITestBase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work stopped] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1709 stopped by Lewis John McGibbney. --- > Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain > methods not defined in source .avsc > - > > Key: NUTCH-1709 > URL: https://issues.apache.org/jira/browse/NUTCH-1709 > Project: Nutch > Issue Type: Improvement > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 2.5 > > Attachments: NUTCH-1709.patch > > > When using the GoraCompiler currently packaged with gora-core-0.4-SNAPSHOT, > the following methods are removed from o.a.n.storage.Host or > o.a.n.storage.ProtocolStatus > {code:title=Host.java|borderStyle=solid} > public boolean contains(String key) { > return metadata.containsKey(new Utf8(key)); > } > > public String getValue(String key, String defaultValue) { > if (!contains(key)) return defaultValue; > return Bytes.toString(metadata.get(new Utf8(key))); > } > > public int getInt(String key, int defaultValue) { > if (!contains(key)) return defaultValue; > return Integer.parseInt(getValue(key,null)); > } > public long getLong(String key, long defaultValue) { > if (!contains(key)) return defaultValue; > return Long.parseLong(getValue(key,null)); > } > {code} > {code:title=ProtocolStatus.java|borderStyle=solid} > /** >* A convenience method which returns a successful {@link ProtocolStatus}. >* @return the {@link ProtocolStatus} value for 200 (success). >*/ > public boolean isSuccess() { > return code == ProtocolStatusUtils.SUCCESS; > } > {code} > This results in compilation errors... I am not sure if it is good practice > for non-default methods to be contained within generated Persistent classes. > This is certainly the case with newer versions of Avro when using the Java > API. > compile-core: > [javac] Compiling 104 source files to > /home/mary/Downloads/apache/2.x/build/classes > [javac] warning: [options] bootstrap class path not set in conjunction > with -source 1.6 > [javac] > /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:345: > error: cannot find symbol > [javac]host.getInt("q_mt", > maxThreads), > [javac]^ > [javac] symbol: method getInt(String,int) > [javac] location: variable host of type Host > [javac] > /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:346: > error: cannot find symbol > [javac]host.getLong("q_cd", > crawlDelay), > [javac]^ > [javac] symbol: method getLong(String,long) > [javac] location: variable host of type Host > [javac] > /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:347: > error: cannot find symbol > [javac]host.getLong("q_mcd", > minCrawlDelay)); > [javac]^ > [javac] symbol: method getLong(String,long) > [javac] location: variable host of type Host > [javac] > /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/parse/ParserChecker.java:114: > error: cannot find symbol > [javac] if(!protocolOutput.getStatus().isSuccess()) { > [javac] ^ > [javac] symbol: method isSuccess() > [javac] location: class ProtocolStatus > [javac] Note: > /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/storage/Host.java > uses unchecked or unsafe operations. > [javac] Note: Recompile with -Xlint:unchecked for details. > [javac] 4 errors > [javac] 1 warning > I think it would be a good idea to find another home for such methods as it > will undoubtedly avoid problems when we do Gora upgrades in the future. > Right now I don't have a suggestion but will work on a solution non-the-less. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2722) Fetch dependencies via https
[ https://issues.apache.org/jira/browse/NUTCH-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2722: Fix Version/s: (was: 2.5) > Fetch dependencies via https > > > Key: NUTCH-2722 > URL: https://issues.apache.org/jira/browse/NUTCH-2722 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 2.4, 2.5, 1.16 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 2.4, 1.16 > > > Dependencies need to be fetched via https, see > https://central.sonatype.org/articles/2019/Apr/30/http-access-to-repo1mavenorg-and-repomavenapacheorg-is-being-deprecated/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Release Apache Nutch 1.16 RC#1
Hi Seb, Sigs check out fine gpg --verify apache-nutch-1.16-src.tar.gz.asc apache-nutch-1.16-src.tar.gz gpg: Signature made Wed Oct 2 08:07:47 2019 PDT gpg:using RSA key FF82A487F92D70E52FF77E0AC66EA7B7DB0A9C6D gpg: Good signature from "Sebastian Nagel " [unknown] gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: FF82 A487 F92D 70E5 2FF7 7E0A C66E A7B7 DB0A 9C6D sha512sum --check apache-nutch-1.16-src.tar.gz.sha512 apache-nutch-1.16-src.tar.gz: OK Tests pass successfully All top level files are fine in terms of dates and licenses. [X] +1 Release this package as Apache Nutch 1.16. On 2019/10/02 17:54:59, Sebastian Nagel wrote: > Hi Folks, > > A first candidate for the Nutch 1.16 release is available at: > >https://dist.apache.org/repos/dist/dev/nutch/1.16/ > > The release candidate is a zip and tar.gz archive of the binary and sources > in: >https://github.com/apache/nutch/tree/release-1.16 > > In addition, a staged maven repository is available here: >https://repository.apache.org/content/repositories/orgapachenutch-1017/ > > We addressed 104 Issues: >https://s.apache.org/l2j94 > > Please vote on releasing this package as Apache Nutch 1.16. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch 1.16. > [ ] -1 Do not release this package because… > > Cheers, > Sebastian > (On behalf of the Nutch PMC) > > P.S. Here is my +1. >
[jira] [Comment Edited] (NUTCH-2669) Reliable solution for javax.ws packaging.type
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792267#comment-16792267 ] Lewis John McGibbney edited comment on NUTCH-2669 at 3/14/19 2:32 AM: -- [~wastl-nagel] this has become a blocker issue whilst attempting to roll the 2.4 release candidate. I've tried using multiple combinations of proposed fixes but I cannot get Nutch branch-2.4 to build from source any more. {code} ant clean test ... ... ... resolve-default: [ivy:resolve] :: loading settings :: file = /Users/lmcgibbn/Downloads/nutch/ivy/ivysettings.xml [ivy:resolve] [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] module not found: javax.measure#unit-api;working@LMC-056430 [ivy:resolve] local: tried [ivy:resolve] /Users/lmcgibbn/.ivy2/local/javax.measure/unit-api/working@LMC-056430/ivys/ivy.xml [ivy:resolve] -- artifact javax.measure#unit-api;working@LMC-056430!unit-api.jar: [ivy:resolve] /Users/lmcgibbn/.ivy2/local/javax.measure/unit-api/working@LMC-056430/jars/unit-api.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom [ivy:resolve] -- artifact javax.measure#unit-api;working@LMC-056430!unit-api.jar: [ivy:resolve] http://repo1.maven.org/maven2/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar [ivy:resolve] sonatype: tried [ivy:resolve] http://oss.sonatype.org/content/repositories/releases/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom [ivy:resolve] -- artifact javax.measure#unit-api;working@LMC-056430!unit-api.jar: [ivy:resolve] http://oss.sonatype.org/content/repositories/releases/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar [ivy:resolve] apache-snapshot: tried [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom [ivy:resolve] -- artifact javax.measure#unit-api;working@LMC-056430!unit-api.jar: [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar [ivy:resolve] restlet: tried [ivy:resolve] http://maven.restlet.org/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.pom [ivy:resolve] -- artifact javax.measure#unit-api;working@LMC-056430!unit-api.jar: [ivy:resolve] http://maven.restlet.org/javax/measure/unit-api/working@LMC-056430/unit-api-work...@lmc-056430.jar [ivy:resolve] ERRORS [ivy:resolve] impossible to get artifacts when data has not been loaded. IvyNode = javax.measure#unit-api;1.0 [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS {code} was (Author: lewismc): [~wastl-nagel] has become a major pain whilst attempting to roll the 2.4 release candidate. The release is essentially blocked until this issue is resolved. > Reliable solution for javax.ws packaging.type > - > > Key: NUTCH-2669 > URL: https://issues.apache.org/jira/browse/NUTCH-2669 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 2.4, 1.16 >Reporter: Sebastian Nagel >Priority: Blocker > Fix For: 2.4 > > > The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an > ant/ivy issue during build when resolving/fetching dependencies: > {noformat} > [ivy:resolve] [FAILED ] > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}: (0ms) > [ivy:resolve] local: tried > [ivy:resolve] > /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type} > [ivy:resolve] maven2: tried > [ivy:resolve] > http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] apache-snapshot: tried > [ivy:resolve] > https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] sonatype: tried > [ivy:resolve] > http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] :: FAILED DOWNLOADS:: > [ivy:resolve] :: ^ see resolution messages for detai
[jira] [Updated] (NUTCH-2669) Reliable solution for javax.ws packaging.type
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2669: Priority: Blocker (was: Major) > Reliable solution for javax.ws packaging.type > - > > Key: NUTCH-2669 > URL: https://issues.apache.org/jira/browse/NUTCH-2669 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 2.4, 1.16 >Reporter: Sebastian Nagel >Priority: Blocker > Fix For: 2.5 > > > The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an > ant/ivy issue during build when resolving/fetching dependencies: > {noformat} > [ivy:resolve] [FAILED ] > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}: (0ms) > [ivy:resolve] local: tried > [ivy:resolve] > /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type} > [ivy:resolve] maven2: tried > [ivy:resolve] > http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] apache-snapshot: tried > [ivy:resolve] > https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] sonatype: tried > [ivy:resolve] > http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] :: FAILED DOWNLOADS:: > [ivy:resolve] :: ^ see resolution messages for details ^ :: > [ivy:resolve] :: > [ivy:resolve] :: > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] ERRORS > ... > BUILD FAILED > {noformat} > More information about this issue is linked on > [jax-rs#576|https://github.com/jax-rs/api/pull/576]. > A work-around is to define a property {{packaging.type}} and set it to > {{jar}}. This can be done > - in command-line {{ant -Dpackaging.type=jar ...}} > - in default.properties > - in ivysettings.xml > The last work-around is active in current master/1.x. However, there are > still Jenkins builds failing while few succeed: > ||#build||status jax-rs||machine||work-around|| > |3578|success|H28|ivysettings.xml| > |3577|failed|H28|ivysettings.xml| > |3576|failed|H33|ivysettings.xml| > |3575|success|ubuntu-4|ivysettings.xml| > |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties| > |3571|failed|?|-Dpackaging.type=jar + default.properties| > |3568|failed|?|-Dpackaging.type=jar + default.properties| > Builds which failed for other reasons are left away. The only pattern I see > is that only the second build on every of the Jenkins machines succeeds. A > possible reason could be that the build environments on the machines persist > state (the Nutch build directory, local ivy cache, etc.). If this is the > case, it may take some time until all Jenkins machines will succeed. > The ivysettings.xml work-around was the first which succeeded on a Jenkins > build but it may be the case that all three work-arounds apply. > The issue is supposed to be resolved (without work-arounds) by IVY-1577. > However, it looks like it isn't: > - get rc2 of ivy 2.5.0 (the URL may change): > {noformat} > % wget -O ivy/ivy-2.5.0-rc2-test.jar \ > > https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar > {noformat} > - edit default properties and set {{ivy.version=2.5.0-rc2-test}} > - remove work-around in ivysettings.xml (or default.properties) > - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib > is in place: {{ls build/lib/javax.ws.rs-api*.jar}} > This solution fails for > [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar] > and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is > wrong, I'll contact the ant/ivy team to solve this. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2669) Reliable solution for javax.ws packaging.type
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2669: Fix Version/s: (was: 2.5) 2.4 > Reliable solution for javax.ws packaging.type > - > > Key: NUTCH-2669 > URL: https://issues.apache.org/jira/browse/NUTCH-2669 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 2.4, 1.16 >Reporter: Sebastian Nagel >Priority: Blocker > Fix For: 2.4 > > > The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an > ant/ivy issue during build when resolving/fetching dependencies: > {noformat} > [ivy:resolve] [FAILED ] > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}: (0ms) > [ivy:resolve] local: tried > [ivy:resolve] > /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type} > [ivy:resolve] maven2: tried > [ivy:resolve] > http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] apache-snapshot: tried > [ivy:resolve] > https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] sonatype: tried > [ivy:resolve] > http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] :: FAILED DOWNLOADS:: > [ivy:resolve] :: ^ see resolution messages for details ^ :: > [ivy:resolve] :: > [ivy:resolve] :: > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] ERRORS > ... > BUILD FAILED > {noformat} > More information about this issue is linked on > [jax-rs#576|https://github.com/jax-rs/api/pull/576]. > A work-around is to define a property {{packaging.type}} and set it to > {{jar}}. This can be done > - in command-line {{ant -Dpackaging.type=jar ...}} > - in default.properties > - in ivysettings.xml > The last work-around is active in current master/1.x. However, there are > still Jenkins builds failing while few succeed: > ||#build||status jax-rs||machine||work-around|| > |3578|success|H28|ivysettings.xml| > |3577|failed|H28|ivysettings.xml| > |3576|failed|H33|ivysettings.xml| > |3575|success|ubuntu-4|ivysettings.xml| > |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties| > |3571|failed|?|-Dpackaging.type=jar + default.properties| > |3568|failed|?|-Dpackaging.type=jar + default.properties| > Builds which failed for other reasons are left away. The only pattern I see > is that only the second build on every of the Jenkins machines succeeds. A > possible reason could be that the build environments on the machines persist > state (the Nutch build directory, local ivy cache, etc.). If this is the > case, it may take some time until all Jenkins machines will succeed. > The ivysettings.xml work-around was the first which succeeded on a Jenkins > build but it may be the case that all three work-arounds apply. > The issue is supposed to be resolved (without work-arounds) by IVY-1577. > However, it looks like it isn't: > - get rc2 of ivy 2.5.0 (the URL may change): > {noformat} > % wget -O ivy/ivy-2.5.0-rc2-test.jar \ > > https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar > {noformat} > - edit default properties and set {{ivy.version=2.5.0-rc2-test}} > - remove work-around in ivysettings.xml (or default.properties) > - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib > is in place: {{ls build/lib/javax.ws.rs-api*.jar}} > This solution fails for > [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar] > and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is > wrong, I'll contact the ant/ivy team to solve this. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792267#comment-16792267 ] Lewis John McGibbney commented on NUTCH-2669: - [~wastl-nagel] has become a major pain whilst attempting to roll the 2.4 release candidate. The release is essentially blocked until this issue is resolved. > Reliable solution for javax.ws packaging.type > - > > Key: NUTCH-2669 > URL: https://issues.apache.org/jira/browse/NUTCH-2669 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 2.4, 1.16 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 2.5 > > > The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an > ant/ivy issue during build when resolving/fetching dependencies: > {noformat} > [ivy:resolve] [FAILED ] > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}: (0ms) > [ivy:resolve] local: tried > [ivy:resolve] > /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type} > [ivy:resolve] maven2: tried > [ivy:resolve] > http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] apache-snapshot: tried > [ivy:resolve] > https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] sonatype: tried > [ivy:resolve] > http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] :: FAILED DOWNLOADS:: > [ivy:resolve] :: ^ see resolution messages for details ^ :: > [ivy:resolve] :: > [ivy:resolve] :: > javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] ERRORS > ... > BUILD FAILED > {noformat} > More information about this issue is linked on > [jax-rs#576|https://github.com/jax-rs/api/pull/576]. > A work-around is to define a property {{packaging.type}} and set it to > {{jar}}. This can be done > - in command-line {{ant -Dpackaging.type=jar ...}} > - in default.properties > - in ivysettings.xml > The last work-around is active in current master/1.x. However, there are > still Jenkins builds failing while few succeed: > ||#build||status jax-rs||machine||work-around|| > |3578|success|H28|ivysettings.xml| > |3577|failed|H28|ivysettings.xml| > |3576|failed|H33|ivysettings.xml| > |3575|success|ubuntu-4|ivysettings.xml| > |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties| > |3571|failed|?|-Dpackaging.type=jar + default.properties| > |3568|failed|?|-Dpackaging.type=jar + default.properties| > Builds which failed for other reasons are left away. The only pattern I see > is that only the second build on every of the Jenkins machines succeeds. A > possible reason could be that the build environments on the machines persist > state (the Nutch build directory, local ivy cache, etc.). If this is the > case, it may take some time until all Jenkins machines will succeed. > The ivysettings.xml work-around was the first which succeeded on a Jenkins > build but it may be the case that all three work-arounds apply. > The issue is supposed to be resolved (without work-arounds) by IVY-1577. > However, it looks like it isn't: > - get rc2 of ivy 2.5.0 (the URL may change): > {noformat} > % wget -O ivy/ivy-2.5.0-rc2-test.jar \ > > https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar > {noformat} > - edit default properties and set {{ivy.version=2.5.0-rc2-test}} > - remove work-around in ivysettings.xml (or default.properties) > - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib > is in place: {{ls build/lib/javax.ws.rs-api*.jar}} > This solution fails for > [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar] > and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is > wrong, I'll contact the ant/ivy team to solve this. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2498) Docker files are outdated
[ https://issues.apache.org/jira/browse/NUTCH-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788812#comment-16788812 ] Lewis John McGibbney commented on NUTCH-2498: - [~dhirajforyou] thank you for reporting this. I am just about to push a release for Nutch 2.4 but this is the last release for the 2.x release line. Do you want to provide a pull request? If not then please just resolve this as won't fix. > Docker files are outdated > - > > Key: NUTCH-2498 > URL: https://issues.apache.org/jira/browse/NUTCH-2498 > Project: Nutch > Issue Type: Bug > Components: docker >Affects Versions: 2.4 >Reporter: dhirajforyou >Priority: Blocker > Labels: build > Fix For: 2.4 > > > Docker file for hbase is outdated. It uses java 7 but, nutch requires java 8. > Casandra docker file refers to meabed/debian-jdk, which is also based on > java7. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Mavenize Nutch Build as Google Summer of Code
Hi user@ and dev@, If you are a student and would like to tackle the task of Mavenizing the Nutch master build please get in touch with me here, directly or comment on the following issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-2292 Thank you Lewis -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Resolved] (NUTCH-2698) Remove sonar build task from build.xml
[ https://issues.apache.org/jira/browse/NUTCH-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2698. - Resolution: Fixed Thanks [~wastl-nagel] for review. > Remove sonar build task from build.xml > -- > > Key: NUTCH-2698 > URL: https://issues.apache.org/jira/browse/NUTCH-2698 > Project: Nutch > Issue Type: Task > Components: build >Affects Versions: 1.15 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.16 > > > build.xml currently has the following content > {code} > > > > > > > > > > > > > > /> > > > > version="1.4-SNAPSHOT" xmlns:sonar="antlib:org.sonar.ant"/> > > {code} > We should simply remove as it is defunct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins
[ https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2292: Labels: gsoc2019 (was: ) > Mavenize the build for nutch-core and nutch-plugins > --- > > Key: NUTCH-2292 > URL: https://issues.apache.org/jira/browse/NUTCH-2292 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Thamme Gowda >Assignee: Thamme Gowda >Priority: Major > Labels: gsoc2019 > Fix For: 1.16 > > > Convert the build system of nutch-core as well as plugins to Apache Maven. > *Plan :* > Create multi-module maven project with the following structure > {code} > nutch-parent > |-- pom.xml (POM) > |-- nutch-core > | |-- pom.xml (JAR) > | |--src: sources > |-- nutch-plugins > |-- pom.xml (POM) > |-- plugin1 > ||-- pom.xml (JAR) > | . > |-- pluginN >|-- pom.xml (JAR) > {code} > NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, > introduce another POM to break the cycle if required. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins
[ https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782482#comment-16782482 ] Lewis John McGibbney commented on NUTCH-2292: - Hi [~wastl-nagel] long story short... we need to rebase this against master and then talk through the issue. As it was a while ago, I've all but forgotten what decisions we made at the time and what the implications are/were. What about we propose this as a GSoC project? > Mavenize the build for nutch-core and nutch-plugins > --- > > Key: NUTCH-2292 > URL: https://issues.apache.org/jira/browse/NUTCH-2292 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Thamme Gowda >Assignee: Thamme Gowda >Priority: Major > Fix For: 1.16 > > > Convert the build system of nutch-core as well as plugins to Apache Maven. > *Plan :* > Create multi-module maven project with the following structure > {code} > nutch-parent > |-- pom.xml (POM) > |-- nutch-core > | |-- pom.xml (JAR) > | |--src: sources > |-- nutch-plugins > |-- pom.xml (POM) > |-- plugin1 > ||-- pom.xml (JAR) > | . > |-- pluginN >|-- pom.xml (JAR) > {code} > NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, > introduce another POM to break the cycle if required. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2698) Remove sonar build task from build.xml
Lewis John McGibbney created NUTCH-2698: --- Summary: Remove sonar build task from build.xml Key: NUTCH-2698 URL: https://issues.apache.org/jira/browse/NUTCH-2698 Project: Nutch Issue Type: Task Components: build Affects Versions: 1.15 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.16 build.xml currently has the following content {code} {code} We should simply remove as it is defunct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2697) Upgrade Ivy to fix the issue of an unset packaging.type property.
[ https://issues.apache.org/jira/browse/NUTCH-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782477#comment-16782477 ] Lewis John McGibbney commented on NUTCH-2697: - Apologies folks, thank you [~wastl-nagel] for reverting. This has prompted me to look at another part of build.xml for a fix so I will go ahead and submit that. > Upgrade Ivy to fix the issue of an unset packaging.type property. > - > > Key: NUTCH-2697 > URL: https://issues.apache.org/jira/browse/NUTCH-2697 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.16 >Reporter: Chris Gavin >Priority: Major > Fix For: 1.16 > > > Currently Nutch fails to build from a clean checkout due to > {{packaging.type}} not being set (even with the current workaround in > {{ivysettings.xml}}). > {code:java} > [ivy:resolve] :: problems summary :: > [ivy:resolve] WARNINGS > [ivy:resolve] [FAILED ] > javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}: (0ms) > [ivy:resolve] local: tried > [ivy:resolve] > /opt/work/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1.1/${packaging.type}s/javax.ws.rs-api.${packaging.type} > [ivy:resolve] maven2: tried > [ivy:resolve] > http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type} > [ivy:resolve] apache-snapshot: tried > [ivy:resolve] > https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type} > [ivy:resolve] sonatype: tried > [ivy:resolve] > http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] :: FAILED DOWNLOADS :: > [ivy:resolve] :: ^ see resolution messages for details ^ :: > [ivy:resolve] :: > [ivy:resolve] :: > javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type} > [ivy:resolve] :: > [ivy:resolve] > BUILD FAILED{code} > This issue has been fixed in the latest version of Ivy so upgrading will > cause the build to work correctly again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2679) "ant eclipse" failed as eclipse binary is moved
[ https://issues.apache.org/jira/browse/NUTCH-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718828#comment-16718828 ] lewis john mcgibbney commented on NUTCH-2679: - Can we use https://search.maven.org/artifact/ant4eclipse/ant4eclipse/0.5.0.rc1/jar ? -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc > "ant eclipse" failed as eclipse binary is moved > --- > > Key: NUTCH-2679 > URL: https://issues.apache.org/jira/browse/NUTCH-2679 > Project: Nutch > Issue Type: Test > Components: build >Affects Versions: 1.15 >Reporter: dhirajforyou >Priority: Major > > > curl -I > "https://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2; > HTTP/1.1 302 Found > Server: nginx/1.13.12 > Date: Wed, 12 Dec 2018 10:32:28 GMT > Content-Type: text/html; charset=UTF-8 > Connection: keep-alive > Content-Disposition: attachment; filename="ant-eclipse-1.0.bin.tar.bz2" > Set-Cookie: > sf_mirror_attempt="ant-eclipse:liquidtelecom:ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2"; > Max-Age=120; Path=/; expires=Wed, 12-Dec-2018 10:34:28 GMT > Location: > [https://liquidtelecom.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2] > > so the eclipse binary src need to be changed. > > @ [~wastl-nagel] @ [~lewismc] > last time we changed http to https and this time url got changed. > can you suggest the best to overcome this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins
[ https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704793#comment-16704793 ] Lewis John McGibbney commented on NUTCH-2292: - I'm going to take this work on from where [~thammegowda] got to. I'll see if I can forward merge master into NUTCH-2292 branch first. > Mavenize the build for nutch-core and nutch-plugins > --- > > Key: NUTCH-2292 > URL: https://issues.apache.org/jira/browse/NUTCH-2292 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Thamme Gowda >Assignee: Thamme Gowda >Priority: Major > Fix For: 1.16 > > > Convert the build system of nutch-core as well as plugins to Apache Maven. > *Plan :* > Create multi-module maven project with the following structure > {code} > nutch-parent > |-- pom.xml (POM) > |-- nutch-core > | |-- pom.xml (JAR) > | |--src: sources > |-- nutch-plugins > |-- pom.xml (POM) > |-- plugin1 > ||-- pom.xml (JAR) > | . > |-- pluginN >|-- pom.xml (JAR) > {code} > NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, > introduce another POM to break the cycle if required. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Maven vs Gradle for Nutch Build System
Hi Folks, Seb and I were talking build systems this week. I wanted to get a feel for what we as a PMC would rather use for the next Nutch build lifecycle. Personall I've used Maven for many of y Java projects however I have also really enjoyed working with Gradle. I would like t start working on the build system such that we can streamline the Nutch release process. Also we've seen people request Nutch plugins as Maven artifacts for some time. Any thoughts? Lewis -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Created] (NUTCH-2677) Update Jest client in indexer-elastic-rest plugin
Lewis John McGibbney created NUTCH-2677: --- Summary: Update Jest client in indexer-elastic-rest plugin Key: NUTCH-2677 URL: https://issues.apache.org/jira/browse/NUTCH-2677 Project: Nutch Issue Type: Task Components: indexer, plugin Affects Versions: 1.15 Reporter: Lewis John McGibbney Fix For: 1.16 We should really upgrade the dependency to a more recent version https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/ivy.xml We are using 2.0.1, the most recent is 6.3.1 https://search.maven.org/artifact/io.searchbox/jest/6.3.1/jar -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2667) Update Tika and Commons Collections 4
[ https://issues.apache.org/jira/browse/NUTCH-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2667: Description: Tika and Commons Collections 4 need to be updated. This issue needs to address them. (was: Tika and Commons Collections 4 need to be updated due to known CVE's. This issue needs to address them.) > Update Tika and Commons Collections 4 > - > > Key: NUTCH-2667 > URL: https://issues.apache.org/jira/browse/NUTCH-2667 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.4 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 2.4 > > > Tika and Commons Collections 4 need to be updated. This issue needs to > address them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2667) Update Tika and Commons Collections 4
Lewis John McGibbney created NUTCH-2667: --- Summary: Update Tika and Commons Collections 4 Key: NUTCH-2667 URL: https://issues.apache.org/jira/browse/NUTCH-2667 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.4 Tika and Commons Collections 4 need to be updated due to known CVE's. This issue needs to address them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2199) Documentation for Nutch 2.X REST API
[ https://issues.apache.org/jira/browse/NUTCH-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2199. - Resolution: Fixed > Documentation for Nutch 2.X REST API > > > Key: NUTCH-2199 > URL: https://issues.apache.org/jira/browse/NUTCH-2199 > Project: Nutch > Issue Type: New Feature > Components: documentation, REST_api >Affects Versions: 2.3.1 > Reporter: Lewis John McGibbney >Assignee: Furkan KAMACI >Priority: Minor > Fix For: 2.5 > > > The work done on NUTCH-1800 needs to be ported to 2.X branch. This is > trivial, I thought I had already done it but obviously not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol
[ https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594347#comment-16594347 ] Lewis John McGibbney commented on NUTCH-1861: - [~yossi] the existing JavaMail license is incompatible with ALv2.0 meaning that I am going to look at using the following instead {code} org.apache.geronimo.javamail geronimo-javamail_1.4 1.9.0-alpha-2 pom {code} > Implement POP3 Protocol > --- > > Key: NUTCH-1861 > URL: https://issues.apache.org/jira/browse/NUTCH-1861 > Project: Nutch > Issue Type: Task > Components: protocol > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > > Implementing the Post Office Protocol within Nutch would open up a new use > case which is crawling and indexing of some mail servers. > This is particularly useful for investigation purposes or for porting/mapping > mail from one server to another. > We *may* be able to kil two bird with the one stone by implementing both IMAP > and POP3 protocols under the one plugin. > http://commons.apache.org/proper/commons-net/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol
[ https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594141#comment-16594141 ] Lewis John McGibbney commented on NUTCH-1861: - Hi [~yossi] thanks for response bq. Isn't SMTP only for sending (and relaying) messages? How can it be used for crawling? When I opened this ticket my understanding of SMTP was much less (it is still not great however it is slightly better). Your description sounds correct. bq. I assume crawling in this instance will be in the context of a specific user (with password), but this user may have access to multiple mailboxes/folders (at least with IMAP, I don't think POP3 supports such features). Do you intend to support multiple users/passwords? Yes. These URI's would be injected as normal or read from configuration. bq. Why did you choose Commons Net over JavaMail? I was not sure which implementation was more suited to what we were looking to achieve. If JavaMail is the way to go, then I will code it up using that underlying library instead. > Implement POP3 Protocol > --- > > Key: NUTCH-1861 > URL: https://issues.apache.org/jira/browse/NUTCH-1861 > Project: Nutch > Issue Type: Task > Components: protocol > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > > Implementing the Post Office Protocol within Nutch would open up a new use > case which is crawling and indexing of some mail servers. > This is particularly useful for investigation purposes or for porting/mapping > mail from one server to another. > We *may* be able to kil two bird with the one stone by implementing both IMAP > and POP3 protocols under the one plugin. > http://commons.apache.org/proper/commons-net/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol
[ https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593875#comment-16593875 ] Lewis John McGibbney commented on NUTCH-1861: - Hi Folks, using commons-net I was thinking of bulking support for SMTP(S), POP3(S) and IMAP(S) into the same _*protocol-email*_ plugin. Is this a better architecture choice than separating each implementation out into an individual protocol plugin? > Implement POP3 Protocol > --- > > Key: NUTCH-1861 > URL: https://issues.apache.org/jira/browse/NUTCH-1861 > Project: Nutch > Issue Type: Task > Components: protocol > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > > Implementing the Post Office Protocol within Nutch would open up a new use > case which is crawling and indexing of some mail servers. > This is particularly useful for investigation purposes or for porting/mapping > mail from one server to another. > We *may* be able to kil two bird with the one stone by implementing both IMAP > and POP3 protocols under the one plugin. > http://commons.apache.org/proper/commons-net/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-1861) Implement POP3 Protocol
[ https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1861: --- Assignee: Lewis John McGibbney > Implement POP3 Protocol > --- > > Key: NUTCH-1861 > URL: https://issues.apache.org/jira/browse/NUTCH-1861 > Project: Nutch > Issue Type: Task > Components: protocol > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > > Implementing the Post Office Protocol within Nutch would open up a new use > case which is crawling and indexing of some mail servers. > This is particularly useful for investigation purposes or for porting/mapping > mail from one server to another. > We *may* be able to kil two bird with the one stone by implementing both IMAP > and POP3 protocols under the one plugin. > http://commons.apache.org/proper/commons-net/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2633) Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13
[ https://issues.apache.org/jira/browse/NUTCH-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2633. - Resolution: Fixed We can address Ivy issues in a separate patch. > Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 > -- > > Key: NUTCH-2633 > URL: https://issues.apache.org/jira/browse/NUTCH-2633 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.16 > Environment: java version "10.0.2" 2018-07-17 > Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13) > Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode) > Nutch master 01c5d6ea17d7b60d25d4e65462b2a654f10680c3 (Thu Jul 26 14:55:38 > 2018 +0200) > Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.16 > > > I just got around to making a dev upgrade to >= JDK 10. > When building master using environment JDK > I get several compile time deprecations which are reflected in the attached > build log. > Additionally, I get some issues with Ivy... see below > {code} > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.ivy.util.url.IvyAuthenticator > (file:/Users/lmcgibbn/.ant/lib/ivy-2.3.0.jar) to field > java.net.Authenticator.theAuthenticator > WARNING: Please consider reporting this to the maintainers of > org.apache.ivy.util.url.IvyAuthenticator > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > [ivy:resolve] :: problems summary :: > [ivy:resolve] ERRORS > [ivy:resolve] unknown resolver null > [ivy:resolve] unknown resolver null > [ivy:resolve] unknown resolver null > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2633) Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13
Lewis John McGibbney created NUTCH-2633: --- Summary: Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 Key: NUTCH-2633 URL: https://issues.apache.org/jira/browse/NUTCH-2633 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.16 Environment: java version "10.0.2" 2018-07-17 Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13) Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode) Nutch master 01c5d6ea17d7b60d25d4e65462b2a654f10680c3 (Thu Jul 26 14:55:38 2018 +0200) Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.16 I just got around to making a dev upgrade to >= JDK 10. When building master using environment JDK I get several compile time deprecations which are reflected in the attached build log. Additionally, I get some issues with Ivy... see below {code} WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.ivy.util.url.IvyAuthenticator (file:/Users/lmcgibbn/.ant/lib/ivy-2.3.0.jar) to field java.net.Authenticator.theAuthenticator WARNING: Please consider reporting this to the maintainers of org.apache.ivy.util.url.IvyAuthenticator WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release [ivy:resolve] :: problems summary :: [ivy:resolve] ERRORS [ivy:resolve] unknown resolver null [ivy:resolve] unknown resolver null [ivy:resolve] unknown resolver null {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-. - Resolution: Fixed Thank you [~alaffet] and everyone else for attempting to fix. > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Furkan KAMACI >Priority: Major > Fix For: 2.4 > > Attachments: NUTCH-.patch, TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564745#comment-16564745 ] Lewis John McGibbney commented on NUTCH-: - [~alaffet] thank you, can you please provide a patch? > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Furkan KAMACI >Priority: Major > Fix For: 2.4 > > Attachments: TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9
[ https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484563#comment-16484563 ] Lewis John McGibbney commented on NUTCH-2512: - See my comment above... > Nutch 1.14 does not work under JDK9 > --- > > Key: NUTCH-2512 > URL: https://issues.apache.org/jira/browse/NUTCH-2512 > Project: Nutch > Issue Type: Bug > Components: build, injector >Affects Versions: 1.14 > Environment: Ubuntu 16.04 (All patches up to 02/20/2018) > Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > > Nutch 1.14 (Source) does not compile properly under JDK 9 > Nutch 1.14 (Binary) does not function under Java 9 > > When trying to Nuild Nutch, Ant complains about missing Sonar files then > exits with: > "BUILD FAILED > /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " > > Once having commented out the "offending code" the Build finishes but the > resulting Binary fails to function (as well as the Apache Compiled Binary > distribution), Both exit with: > > Injecting seed URLs > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Injector: starting at 2018-02-21 02:02:16 > Injector: crawlDb: searchcrawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method > sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Injector: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) > at > org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.Injector.inject(Injector.java:417) > at org.apache.nutch.crawl.Injector.run(Injector.java:563) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.Injector.main(Injector.java:528) > > Error running: > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Failed with exit value 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2539: Fix Version/s: 1.15 > Not correct naming of db.url.filters and db.url.normalizers in > nutch-default.xml > > > Key: NUTCH-2539 > URL: https://issues.apache.org/jira/browse/NUTCH-2539 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Major > Fix For: 1.15 > > > There is a mismatch between config and code. > In code, > In CrawlDbFilter line 41:43 > > public static final String URL_FILTERING = "crawldb.url.filters"; > > public static final String URL_NORMALIZING = "crawldb.url.normalizers"; > > public static final String URL_NORMALIZING_SCOPE = > > "crawldb.url.normalizers.scope"; > > In nutch-default.xml > > > > db.url.normalizers > > false > > Normalize urls when updating crawldb > > > > > > > > db.url.filters > > false > > Filter urls when updating crawldb > > > These properties should be in line with code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2539. - Resolution: Fixed > Not correct naming of db.url.filters and db.url.normalizers in > nutch-default.xml > > > Key: NUTCH-2539 > URL: https://issues.apache.org/jira/browse/NUTCH-2539 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Major > Fix For: 1.15 > > > There is a mismatch between config and code. > In code, > In CrawlDbFilter line 41:43 > > public static final String URL_FILTERING = "crawldb.url.filters"; > > public static final String URL_NORMALIZING = "crawldb.url.normalizers"; > > public static final String URL_NORMALIZING_SCOPE = > > "crawldb.url.normalizers.scope"; > > In nutch-default.xml > > > > db.url.normalizers > > false > > Normalize urls when updating crawldb > > > > > > > > db.url.filters > > false > > Filter urls when updating crawldb > > > These properties should be in line with code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2550) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2550. - Resolution: Fixed > Fetcher fails to follow redirects > - > > Key: NUTCH-2550 > URL: https://issues.apache.org/jira/browse/NUTCH-2550 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 >Reporter: Hans Brende >Priority: Blocker > Fix For: 1.15 > > > As I detailed in this github > [comment|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#r28470348], > it appears that PR #221 broke redirects. The fetcher will repeatedly fetch > the *original url* rather than the one it's supposed to be redirecting to > until {{http.redirect.max}} is exceeded, and then end with > {{STATUS_FETCH_GONE}}. > I noticed this issue when I was trying to crawl a site with a 301 MOVED > PERMANENTLY status code. > Should be pretty easy to fix though: I was able to get redirects working > again simply by inserting the code {code:java}url = fit.url{code} > [here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L388] > and > [here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L409]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2545. --- > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2545. - Resolution: Fixed > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2536) GeneratorReducer.count is a static variable
[ https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2536. - Resolution: Fixed > GeneratorReducer.count is a static variable > --- > > Key: NUTCH-2536 > URL: https://issues.apache.org/jira/browse/NUTCH-2536 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.3.1 > Environment: Non-distributed, single node, standalone Nutch jobs run > in a sinlge JVM with HBase as the data store. 2.3.1 >Reporter: Ben Vachon >Priority: Minor > Labels: Generate > Fix For: 2.4 > > Original Estimate: 2.4h > Remaining Estimate: 2.4h > > The count field of the GeneratorReducer class is a static field. This means > that if the GeneratorJob is run multiple times within the same JVM, it will > count all the webpages generated across all batches. > The count field is checked against the GeneratorJob's topN configuration > variable, which is described as: > "top threshold for maximum number of URLs permitted in a batch" > I understand this to mean that EACH batch should be capped at the topN value, > not ALL batches. > This isn't a problem with the way that Nutch is typically used because the > script starts a new JVM each time. I'm not using the script, I'm calling the > java classes directly (using the ToolRunner) within an existing JVM, so I'm > categorizing this as an SDK issue. > Changing the field to be non-static will not affect the behavior of the class > as its run by the script. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2545) Upgrade to Any23 2.2
Lewis John McGibbney created NUTCH-2545: --- Summary: Upgrade to Any23 2.2 Key: NUTCH-2545 URL: https://issues.apache.org/jira/browse/NUTCH-2545 Project: Nutch Issue Type: Improvement Components: any23, plugin Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.15 We recently released Any23 2.2. I would like to update the Any23 plugin to this newest version. PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work stopped] (NUTCH-2516) Hadoop imports use wildcards
[ https://issues.apache.org/jira/browse/NUTCH-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2516 stopped by Lewis John McGibbney. --- > Hadoop imports use wildcards > > > Key: NUTCH-2516 > URL: https://issues.apache.org/jira/browse/NUTCH-2516 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > Right now the Hadoop imports use wildcards all over the place. > We wanted to address this during NUTCH-2375 but didn't get around to it. > We should address it in a new issue as it is still important. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()
[ https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415429#comment-16415429 ] Lewis John McGibbney commented on NUTCH-2518: - Yes please do [~omkar20895] Thank you > Must check return value of job.waitForCompletion() > -- > > Key: NUTCH-2518 > URL: https://issues.apache.org/jira/browse/NUTCH-2518 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher, generator, hostdb, linkdb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Assignee: Kenneth McFarland >Priority: Blocker > Fix For: 1.15 > > > The return value of job.waitForCompletion() of the new MapReduce API > (NUTCH-2375) must always be checked. If it's not true, the job has been > failed or killed. Accordingly, the program > - should not proceed with further jobs/steps > - must clean-up temporary data, unlock CrawlDB, etc. > - exit with non-zero exit value, so that scripts running the crawl workflow > can handle the failure > Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR > #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (NUTCH-2516) Hadoop imports use wildcards
[ https://issues.apache.org/jira/browse/NUTCH-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2516 started by Lewis John McGibbney. --- > Hadoop imports use wildcards > > > Key: NUTCH-2516 > URL: https://issues.apache.org/jira/browse/NUTCH-2516 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > Right now the Hadoop imports use wildcards all over the place. > We wanted to address this during NUTCH-2375 but didn't get around to it. > We should address it in a new issue as it is still important. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398025#comment-16398025 ] Lewis John McGibbney commented on NUTCH-2517: - Correct [~wastl-nagel] > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus > Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2427) Remove all the Hadoop wildcard imports.
[ https://issues.apache.org/jira/browse/NUTCH-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2427. - Resolution: Duplicate > Remove all the Hadoop wildcard imports. > --- > > Key: NUTCH-2427 > URL: https://issues.apache.org/jira/browse/NUTCH-2427 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Omkar Reddy >Priority: Minor > Labels: easyfix > > This improvement deals with removing the wildcard imports like "import > org.apache.hadoop.package.* " -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Upgrade to Hadoop 3
Hi Seb, On 2018/03/12 11:00:52, Sebastian Nagelwrote: > Hi, > > > seeing as we have just merged in the 'new' MR patch > > yep, but there's still something to do (NUTCH-2517, ACK, this needs more testing. > NUTCH-2518). I honestly didn't see this come through but yes you are right. > Better to address this before any upgrade of the Hadoop version. ACK > But since there seem to be no breaking MapReduce API changes > http://hadoop.apache.org/docs/r3.0.0/index.html > I would even expect that the Nutch job jar (built for 2.7) > will run on Hadoop 3.0, or does it not? > I have absolutely no idea. I've certainly not had an opportunity to run on H v3 cluster.
[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()
[ https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397808#comment-16397808 ] Lewis John McGibbney commented on NUTCH-2518: - Hi [~wastl-nagel] I think we just overwrote this (as oppose to commit being lost). I can submit a PR to bring the previous functionality back however there is some additional work to be done to address the three bullet points you've highlighted. > Must check return value of job.waitForCompletion() > -- > > Key: NUTCH-2518 > URL: https://issues.apache.org/jira/browse/NUTCH-2518 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher, generator, hostdb, linkdb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.15 > > > The return value of job.waitForCompletion() of the new MapReduce API > (NUTCH-2375) must always be checked. If it's not true, the job has been > failed or killed. Accordingly, the program > - should not proceed with further jobs/steps > - must clean-up temporary data, unlock CrawlDB, etc. > - exit with non-zero exit value, so that scripts running the crawl workflow > can handle the failure > Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR > #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Upgrade to Hadoop 3
Hi RRK, Response inline On 2018/03/08 01:46:18, BlackIcewrote: > > Why do you say "Is it too early"? Could you please elaborate on this, thnx. > What I mean is that maybe a lot of people have not upgraded existing infrastructure to Hadoop 3 yet. People don't usually move large installations for some time... that was all :) Lewis
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391440#comment-16391440 ] Lewis John McGibbney commented on NUTCH-2517: - Can anyone else confirm the above ? > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus > Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391430#comment-16391430 ] Lewis John McGibbney commented on NUTCH-2517: - Hi [~mebbinghaus] I ran it from the Docker container and can reproduce some of your results, there is one nuance however. I'll explain below When I run mergesegs and inspect the data structures created within the mycrawl/MERGEDsegments/segment/... I see BOTH crawl_generate and crawl_parse. So there must be something wrong with your crawl cycle for you only to have generated on directory. I'll leave that down to you to confirm. The other issue however is that when I attempt to invertlinks using one of the merged segs, I end up with the same stack track as you so i am looking into the code right now. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus > Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch
Upgrade to Hadoop 3
Hi Folks, Before we get started with GSoC again, and seeing as we have just merged in the 'new' MR patch, I wonder if folks are partial to migration to Hadoop 3? Is it too early? Comments? Lewis -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388718#comment-16388718 ] Lewis John McGibbney commented on NUTCH-2517: - Should be noted that I didn't run this from the Docker container. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2517: --- Assignee: Lewis John McGibbney > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus > Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650 ] Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 11:09 PM: -- I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650 ] Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 10:50 PM: -- I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650 ] Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 10:49 PM: -- I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388650#comment-16388650 ] Lewis John McGibbney commented on NUTCH-2517: - I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments3 mycrawl/segments
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386469#comment-16386469 ] Lewis John McGibbney commented on NUTCH-2517: - Thank you [~mebbinghaus] for reporting. This appears to be a major bug and hence a blocker for the next release. I will begin work on a solution ASAP. FYI [~omkar20895] this is post Hadoop upgrade. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2517: Priority: Blocker (was: Major) > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2517: Fix Version/s: 1.15 > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Major > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Apache Nutch - Exception since last commit
Hello Marco, Thank you very much for the information. Please register the issue on jira. I will personally look into it and make best efforts to fix the bug if one exists. Please provide details as to how I can reproduce. Thanks, Lewis On Sat, Mar 3, 2018 at 09:14 Marco Ebbinghauswrote: > Hello Lewis, > > I just wanted to let you know, that I am experiencing problems with > nutch since your last merge 4 days ago. I am using the latest-tagged > docker image version of apache/nutch. > > On my live system (which is some weeks old), everything works fine. But > since the last local image-repull today I cannot get nutch working. I am > using a script which does inject, generate, fetch, parse, updatedb, > mergesegs, invertlinks, index circles. > > Everything works fine until the merging / invertlinks. I have three > Segments-Folders with all required subfolders. But it seems like the > merging of the three segments isn't done correct, so that the merged > segment-folder is not complete. I can absolutely reproduce this. > > I haven't further investigated the problem and I have no more time > today. But I wanted to inform you already. Maybe you have an idea. Maybe > I will have some time to further investigate the problem tomorrow. > > Here are the commands that are executed: > > > $NUTCH_HOME/bin/nutch mergesegs > $NUTCH_HOME/$crawlDirName/MERGEDsegments > $NUTCH_HOME/$crawlDirName/segments/* -filter > > rm $RMARGS $NUTCH_HOME/$crawlDirName/segments > > mv $MVARGS $NUTCH_HOME/$crawlDirName/MERGEDsegments > $NUTCH_HOME/$crawlDirName/segments > > $NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/$crawlDirName/linkdb -dir > $NUTCH_HOME/$crawlDirName/segments > > > and I will attach a screenshot with the stacktrace. > > > Greetings, > > > Marco Ebbinghaus > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Updated] (NUTCH-2516) Hadoop imports use wildcards
[ https://issues.apache.org/jira/browse/NUTCH-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2516: Description: Right now the Hadoop imports use wildcards all over the place. We wanted to address this during NUTCH-2375 but didn't get around to it. We should address it in a new issue as it is still important. was: Right now the Hadoop imports use wildcards all over the place. We wanted to address this during NUTCH-2371 but didn't get around to it. We should address it in a new issue as it is still important. > Hadoop imports use wildcards > > > Key: NUTCH-2516 > URL: https://issues.apache.org/jira/browse/NUTCH-2516 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > Right now the Hadoop imports use wildcards all over the place. > We wanted to address this during NUTCH-2375 but didn't get around to it. > We should address it in a new issue as it is still important. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2516) Hadoop imports use wildcards
Lewis John McGibbney created NUTCH-2516: --- Summary: Hadoop imports use wildcards Key: NUTCH-2516 URL: https://issues.apache.org/jira/browse/NUTCH-2516 Project: Nutch Issue Type: Improvement Affects Versions: 1.14 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.15 Right now the Hadoop imports use wildcards all over the place. We wanted to address this during NUTCH-2371 but didn't get around to it. We should address it in a new issue as it is still important. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2375: --- Assignee: Lewis John McGibbney > Upgrade the code base from org.apache.hadoop.mapred to > org.apache.hadoop.mapreduce > -- > > Key: NUTCH-2375 > URL: https://issues.apache.org/jira/browse/NUTCH-2375 > Project: Nutch > Issue Type: Improvement > Components: deployment >Affects Versions: 1.13 >Reporter: Omkar Reddy > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > Nutch is still using the deprecated org.apache.hadoop.mapred dependency which > has been deprecated. It need to be updated to org.apache.hadoop.mapreduce > dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9
[ https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373173#comment-16373173 ] Lewis John McGibbney commented on NUTCH-2512: - Hi [~Bl4ck1c3] thanks for logging the issue... this behavior was to be 'expected'. As you can see from the [javac.version|https://github.com/apache/nutch/blob/release-1.14/default.properties#L60] we had it pinned at 1.8. I suppose we can make the upgrade for 1.15... (queue patch ;)) > Nutch 1.14 does not work under JDK9 > --- > > Key: NUTCH-2512 > URL: https://issues.apache.org/jira/browse/NUTCH-2512 > Project: Nutch > Issue Type: Bug > Components: build, injector >Affects Versions: 1.14 > Environment: Ubuntu 16.04 (All patches up to 02/20/2018) > Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > > Nutch 1.14 (Source) does not compile properly under JDK 9 > Nutch 1.14 (Binary) does not function under Java 9 > > When trying to Nuild Nutch, Ant complains about missing Sonar files then > exits with: > "BUILD FAILED > /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " > > Once having commented out the "offending code" the Build finishes but the > resulting Binary fails to function (as well as the Apache Compiled Binary > distribution), Both exit with: > > Injecting seed URLs > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Injector: starting at 2018-02-21 02:02:16 > Injector: crawlDb: searchcrawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method > sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Injector: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) > at > org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.Injector.inject(Injector.java:417) > at org.apache.nutch.crawl.Injector.run(Injector.java:563) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.Injector.main(Injector.java:528) > > Error running: > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Failed with exit value 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2512) Nutch 1.14 does not work under JDK9
[ https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2512: Fix Version/s: 1.15 > Nutch 1.14 does not work under JDK9 > --- > > Key: NUTCH-2512 > URL: https://issues.apache.org/jira/browse/NUTCH-2512 > Project: Nutch > Issue Type: Bug > Components: build, injector >Affects Versions: 1.14 > Environment: Ubuntu 16.04 (All patches up to 02/20/2018) > Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > > Nutch 1.14 (Source) does not compile properly under JDK 9 > Nutch 1.14 (Binary) does not function under Java 9 > > When trying to Nuild Nutch, Ant complains about missing Sonar files then > exits with: > "BUILD FAILED > /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " > > Once having commented out the "offending code" the Build finishes but the > resulting Binary fails to function (as well as the Apache Compiled Binary > distribution), Both exit with: > > Injecting seed URLs > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Injector: starting at 2018-02-21 02:02:16 > Injector: crawlDb: searchcrawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method > sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Injector: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) > at > org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.Injector.inject(Injector.java:417) > at org.apache.nutch.crawl.Injector.run(Injector.java:563) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.Injector.main(Injector.java:528) > > Error running: > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Failed with exit value 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin
[ https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2489. - Resolution: Fixed Thank you [~yossi] > Dependency collision with lucene-analyzers-common in scoring-similarity plugin > -- > > Key: NUTCH-2489 > URL: https://issues.apache.org/jira/browse/NUTCH-2489 > Project: Nutch > Issue Type: Bug > Components: scoring >Affects Versions: 1.14 >Reporter: Yossi Tamari >Priority: Major > Fix For: 1.15 > > Attachments: ivy.xml.patch > > > After updating to Master branch of 1.14, we get a few compile errors in > LuceneTokenizer.java and LuceneAnalyzerUtil.java: > {code:java} > Type mismatch: cannot convert from org.apache.lucene.analysis.CharArraySet to > org.apache.lucene.analysis.util.CharArraySet > {code} > This seems to be caused by the fact that scoring-similarity compiles with > lucene-analyzers-common-5.5.0.jar (from ivy.xml), but with lucene-core-6.4.1 > instead of the matching 5.5.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin
[ https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2489: Fix Version/s: 1.15 > Dependency collision with lucene-analyzers-common in scoring-similarity plugin > -- > > Key: NUTCH-2489 > URL: https://issues.apache.org/jira/browse/NUTCH-2489 > Project: Nutch > Issue Type: Bug > Components: scoring >Affects Versions: 1.14 >Reporter: Yossi Tamari >Priority: Major > Fix For: 1.15 > > Attachments: ivy.xml.patch > > > After updating to Master branch of 1.14, we get a few compile errors in > LuceneTokenizer.java and LuceneAnalyzerUtil.java: > {code:java} > Type mismatch: cannot convert from org.apache.lucene.analysis.CharArraySet to > org.apache.lucene.analysis.util.CharArraySet > {code} > This seems to be caused by the fact that scoring-similarity compiles with > lucene-analyzers-common-5.5.0.jar (from ivy.xml), but with lucene-core-6.4.1 > instead of the matching 5.5.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2508. - Resolution: Fixed Thank you [~mfeltscher] > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2508: Fix Version/s: 1.15 > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341326#comment-16341326 ] Lewis John McGibbney commented on NUTCH-2369: - Hi [~markus17] the idea here was to export full graph information into something that could be interpreted by [Tinkerpop|http://tinkpop.apache.org] and queried using [Gremlin|https://tinkerpop.apache.org/gremlin.html]. > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > -- > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Labels: gsoc2017, gsoc2018 > Fix For: 1.15 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2369: Labels: gsoc2017 gsoc2018 (was: gsoc2017) > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > -- > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Major > Labels: gsoc2017, gsoc2018 > Fix For: 1.15 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering
[ https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2502. - Resolution: Fixed Thank you [~mfeltscher] > Any23 Plugin: Add Content-Type filtering > > > Key: NUTCH-2502 > URL: https://issues.apache.org/jira/browse/NUTCH-2502 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > It should be possible to filter based on a document's Content-Type when using > Any23 extractors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering
[ https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2502: Fix Version/s: 1.15 > Any23 Plugin: Add Content-Type filtering > > > Key: NUTCH-2502 > URL: https://issues.apache.org/jira/browse/NUTCH-2502 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > It should be possible to filter based on a document's Content-Type when using > Any23 extractors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values
[ https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2499: Fix Version/s: 1.15 > Elastic REST Indexer: Duplicate values > -- > > Key: NUTCH-2499 > URL: https://issues.apache.org/jira/browse/NUTCH-2499 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > Due to a change in > https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a > the Elastic REST indexer does not work with HashSets for values anymore but > instead saves duplicated values as arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2499) Elastic REST Indexer: Duplicate values
[ https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2499. - Resolution: Fixed Thank you [~mfeltscher] > Elastic REST Indexer: Duplicate values > -- > > Key: NUTCH-2499 > URL: https://issues.apache.org/jira/browse/NUTCH-2499 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher > Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > Due to a change in > https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a > the Elastic REST indexer does not work with HashSets for values anymore but > instead saves duplicated values as arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2503) Add option to run tests for a single plugin
[ https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2503. - Resolution: Fixed Thank you [~mfeltscher] > Add option to run tests for a single plugin > --- > > Key: NUTCH-2503 > URL: https://issues.apache.org/jira/browse/NUTCH-2503 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > Sometimes it makes sense to just run tests for a single plugin instead of > building all plugins and running all tests at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2503) Add option to run tests for a single plugin
[ https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2503: Fix Version/s: 1.15 > Add option to run tests for a single plugin > --- > > Key: NUTCH-2503 > URL: https://issues.apache.org/jira/browse/NUTCH-2503 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > Sometimes it makes sense to just run tests for a single plugin instead of > building all plugins and running all tests at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2497) Elastic REST Indexer: Allow multiple hosts
[ https://issues.apache.org/jira/browse/NUTCH-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2497. - Resolution: Fixed Thank you [~mfeltscher] > Elastic REST Indexer: Allow multiple hosts > -- > > Key: NUTCH-2497 > URL: https://issues.apache.org/jira/browse/NUTCH-2497 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > Allow specifying a list of Elasticsearch hosts to index documents to. This > would be especially helpful when working with a Elasticsearch cluster which > contains of multiple nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2497) Elastic REST Indexer: Allow multiple hosts
[ https://issues.apache.org/jira/browse/NUTCH-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2497: Fix Version/s: 1.15 > Elastic REST Indexer: Allow multiple hosts > -- > > Key: NUTCH-2497 > URL: https://issues.apache.org/jira/browse/NUTCH-2497 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > Allow specifying a list of Elasticsearch hosts to index documents to. This > would be especially helpful when working with a Elasticsearch cluster which > contains of multiple nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2461) Generate passes the data to when maxCount == 0
[ https://issues.apache.org/jira/browse/NUTCH-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2461. - Resolution: Fixed Thank you [~semyon.semyo...@mail.com] > Generate passes the data to when maxCount == 0 > --- > > Key: NUTCH-2461 > URL: https://issues.apache.org/jira/browse/NUTCH-2461 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.14 >Reporter: Semyon Semyonov >Priority: Critical > Fix For: 1.15 > > > The generator checks condition > if (maxCount > 0) : line 421 and stop the generation when amount per host > exceeds maxCount( continue : line 455) > but when maxCount == 0 it goes directly to line 465 :output.collect(key, > entry); > It is obviously not correct, the correct solution would be to add > if(maxCount == 0){ > continue; > } > at line 380. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2321) Indexing filter checker leaks threads
[ https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326453#comment-16326453 ] Lewis John McGibbney commented on NUTCH-2321: - Thank you [~jurian] > Indexing filter checker leaks threads > - > > Key: NUTCH-2321 > URL: https://issues.apache.org/jira/browse/NUTCH-2321 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2321.patch > > > Same issue as NUTCH-2320. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2321) Indexing filter checker leaks threads
[ https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2321. - Resolution: Fixed > Indexing filter checker leaks threads > - > > Key: NUTCH-2321 > URL: https://issues.apache.org/jira/browse/NUTCH-2321 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2321.patch > > > Same issue as NUTCH-2320. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-1129) Any23 Nutch plugin
[ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1129: Fix Version/s: (was: 2.5) 1.15 > Any23 Nutch plugin > -- > > Key: NUTCH-1129 > URL: https://issues.apache.org/jira/browse/NUTCH-1129 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-1129.patch > > > This plugin should build on the Any23 library to provide us with a plugin > which extracts RDF data from HTTP and file resources. Although as of writing > Any23 not part of the ASF, the project is working towards integration into > the Apache Incubator. Once the project proves its value, this would be an > excellent addition to the Nutch 1.X codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-1129) Any23 Nutch plugin
[ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1129. - Resolution: Fixed Thank you [~mfeltscher] this is great > Any23 Nutch plugin > -- > > Key: NUTCH-1129 > URL: https://issues.apache.org/jira/browse/NUTCH-1129 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-1129.patch > > > This plugin should build on the Any23 library to provide us with a plugin > which extracts RDF data from HTTP and file resources. Although as of writing > Any23 not part of the ASF, the project is working towards integration into > the Apache Incubator. Once the project proves its value, this would be an > excellent addition to the Nutch 1.X codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script
[ https://issues.apache.org/jira/browse/NUTCH-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2493. - Resolution: Fixed Thank you [~mfeltscher] > Add configuration parameter for sitemap processing to crawler script > > > Key: NUTCH-2493 > URL: https://issues.apache.org/jira/browse/NUTCH-2493 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher > Fix For: 1.15 > > > While using the crawler script with the sitemap processing feature introduced > in NUTCH-2491 I encountered some performance issues when working with large > sitemaps. > Therefore one should be able to specify if sitemap processing based on HostDB > should take place and if so how frequently it should be done. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script
[ https://issues.apache.org/jira/browse/NUTCH-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2493: Fix Version/s: 1.15 > Add configuration parameter for sitemap processing to crawler script > > > Key: NUTCH-2493 > URL: https://issues.apache.org/jira/browse/NUTCH-2493 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher > Fix For: 1.15 > > > While using the crawler script with the sitemap processing feature introduced > in NUTCH-2491 I encountered some performance issues when working with large > sitemaps. > Therefore one should be able to specify if sitemap processing based on HostDB > should take place and if so how frequently it should be done. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2324) Issue in setting default linkdb path
[ https://issues.apache.org/jira/browse/NUTCH-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2324. - Resolution: Fixed Thank you [~sachin] > Issue in setting default linkdb path > - > > Key: NUTCH-2324 > URL: https://issues.apache.org/jira/browse/NUTCH-2324 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.12 >Reporter: Sachin >Priority: Minor > Fix For: 1.15 > > > There is an extra if condition that prevents setting default linkdb path if > we doesn't provide one in REST call. > > Check this : > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexingJob.java#L272 > https://github.com/apache/nutch/pull/153 > PS : Don't know whether it is intentional. You may check! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2324) Issue in setting default linkdb path
[ https://issues.apache.org/jira/browse/NUTCH-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2324: Fix Version/s: 1.15 > Issue in setting default linkdb path > - > > Key: NUTCH-2324 > URL: https://issues.apache.org/jira/browse/NUTCH-2324 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.12 >Reporter: Sachin >Priority: Minor > Fix For: 1.15 > > > There is an extra if condition that prevents setting default linkdb path if > we doesn't provide one in REST call. > > Check this : > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexingJob.java#L272 > https://github.com/apache/nutch/pull/153 > PS : Don't know whether it is intentional. You may check! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2492) Add more configuration parameters to crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2492. - Resolution: Fixed Thank you [~mfeltscher] > Add more configuration parameters to crawl script > -- > > Key: NUTCH-2492 > URL: https://issues.apache.org/jira/browse/NUTCH-2492 > Project: Nutch > Issue Type: New Feature >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher > Fix For: 1.15 > > > Instead of having to copy and adjust the crawl script in order to specify the > following configuration options allow the user to pass them in using > arguments: > - numSlaves > - numTasks > - sizeFetchlist > - timeLimitFetch > - numThreads -- This message was sent by Atlassian JIRA (v6.4.14#64029)