[jira] [Commented] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554866#comment-17554866 ] Lewis John McGibbney commented on NUTCH-2940: - WIP PR available at

[jira] [Commented] (NUTCH-2490) Sitemap processing: Sitemap index files not working

2022-06-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554865#comment-17554865 ] ASF GitHub Bot commented on NUTCH-2490: --- lewismc commented on PR #735: URL:

[GitHub] [nutch] lewismc commented on pull request #735: NUTCH-2490 Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread GitBox
lewismc commented on PR #735: URL: https://github.com/apache/nutch/pull/735#issuecomment-1157159803 I'll squash and merge commits into one when we are ready to merge into `master` branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [nutch] lewismc opened a new pull request, #735: Nutch 2940

2022-06-15 Thread GitBox
lewismc opened a new pull request, #735: URL: https://github.com/apache/nutch/pull/735 This is a WIP for https://issues.apache.org/jira/browse/NUTCH-2940. The work was conducted by @AzureTriple @imanzanganeh @jbsimmon @LilyPerr and @Lirongxuan1 from the 2022 USC Senior CS Capstone Program.

[jira] [Assigned] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2940: --- Assignee: Lewis John McGibbney > Develop Gradle Core Build for Apache Nutch

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554859#comment-17554859 ] ASF GitHub Bot commented on NUTCH-2936: --- lewismc commented on PR #733: URL:

[GitHub] [nutch] lewismc commented on pull request #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode

2022-06-15 Thread GitBox
lewismc commented on PR #733: URL: https://github.com/apache/nutch/pull/733#issuecomment-1157149313 This is exciting!!! Excellent debugging  ... you got further than me. I can't get around to testing it until next week at earliest. Thinking back, I did observe revisits (recursive

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554702#comment-17554702 ] Sebastian Nagel commented on NUTCH-2936: Update: the issue is reproducible also in local mode

[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554701#comment-17554701 ] ASF GitHub Bot commented on NUTCH-2952: --- sebastian-nagel commented on PR #734: URL:

[GitHub] [nutch] sebastian-nagel commented on pull request #734: NUTCH-2952 Upgrade core dependencies

2022-06-15 Thread GitBox
sebastian-nagel commented on PR #734: URL: https://github.com/apache/nutch/pull/734#issuecomment-1156706013 Update: the failing unit test (TestCrawlDbDeduplication) on my development system stem from a modified nutch-site.xml requesting protocol-okhttp - obviously, it's the combination of

[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554663#comment-17554663 ] ASF GitHub Bot commented on NUTCH-2952: --- sebastian-nagel opened a new pull request, #734: URL:

[GitHub] [nutch] sebastian-nagel opened a new pull request, #734: NUTCH-2952 Upgrade core dependencies

2022-06-15 Thread GitBox
sebastian-nagel opened a new pull request, #734: URL: https://github.com/apache/nutch/pull/734 Upgrade of core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more Note: I've observed that some unit tests are failing with same/similar errors than

[jira] [Assigned] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2952: -- Assignee: Sebastian Nagel > Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) >

[jira] [Created] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2952: -- Summary: Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) Key: NUTCH-2952 URL: https://issues.apache.org/jira/browse/NUTCH-2952 Project: Nutch

[jira] [Commented] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554564#comment-17554564 ] Sebastian Nagel commented on NUTCH-2949: This is addressed in

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554563#comment-17554563 ] ASF GitHub Bot commented on NUTCH-2936: --- sebastian-nagel opened a new pull request, #733: URL:

[GitHub] [nutch] sebastian-nagel opened a new pull request, #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode

2022-06-15 Thread GitBox
sebastian-nagel opened a new pull request, #733: URL: https://github.com/apache/nutch/pull/733 Fixes to address the failing of Nutch jobs in (pseudo-)distributed mode. Implements: - caching of URLStreamHandlers per protocol to avoid that handlers are created anew - enforce

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554531#comment-17554531 ] Sebastian Nagel commented on NUTCH-2936: After debugging this: the call by the Hadoop MR Job to