[jira] [Updated] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2936: --- Priority: Blocker (was: Major) > Early registration of URL stream handlers provided by

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476615#comment-17476615 ] Sebastian Nagel commented on NUTCH-2936: Using protocol-okhttp causes parsechecker to raise the

[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476606#comment-17476606 ] ASF GitHub Bot commented on NUTCH-2573: --- sebastian-nagel opened a new pull request #724: URL:

[GitHub] [nutch] sebastian-nagel opened a new pull request #724: NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-15 Thread GitBox
sebastian-nagel opened a new pull request #724: URL: https://github.com/apache/nutch/pull/724 - add properties - `http.robots.503.defer.visits` : enable/disable the feature (default: enabled) - `http.robots.503.defer.visits.delay` : delay to wait before the next

[jira] [Created] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2022-01-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2937: -- Summary: parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode Key: NUTCH-2937 URL: https://issues.apache.org/jira/browse/NUTCH-2937

[jira] [Commented] (NUTCH-2919) NUTCH-2919 Upgrade to Tika 2.2.0 and Any23 2.6

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476614#comment-17476614 ] ASF GitHub Bot commented on NUTCH-2919: --- sebastian-nagel commented on pull request #717: URL:

[GitHub] [nutch] sebastian-nagel commented on pull request #717: NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread GitBox
sebastian-nagel commented on pull request #717: URL: https://github.com/apache/nutch/pull/717#issuecomment-1013689493 Ok, the dependency conflict with commons-io is tracked in NUTCH-2937. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[jira] [Updated] (NUTCH-2919) NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2919: Summary: NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (was: NUTCH-2919 Upgrade

[jira] [Commented] (NUTCH-2919) NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476708#comment-17476708 ] ASF GitHub Bot commented on NUTCH-2919: --- lewismc merged pull request #717: URL:

[GitHub] [nutch] lewismc commented on pull request #717: NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread GitBox
lewismc commented on pull request #717: URL: https://github.com/apache/nutch/pull/717#issuecomment-1013770346 Thanks @sebastian-nagel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [nutch] lewismc merged pull request #717: NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread GitBox
lewismc merged pull request #717: URL: https://github.com/apache/nutch/pull/717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[jira] [Commented] (NUTCH-2919) NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476709#comment-17476709 ] ASF GitHub Bot commented on NUTCH-2919: --- lewismc commented on pull request #717: URL:

[jira] [Resolved] (NUTCH-2919) NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2919. - Resolution: Fixed > NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 >

[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476712#comment-17476712 ] ASF GitHub Bot commented on NUTCH-2573: --- lewismc commented on a change in pull request #724: URL:

[GitHub] [nutch] lewismc commented on a change in pull request #724: NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-15 Thread GitBox
lewismc commented on a change in pull request #724: URL: https://github.com/apache/nutch/pull/724#discussion_r785373191 ## File path: src/java/org/apache/nutch/fetcher/FetchItemQueues.java ## @@ -195,11 +195,15 @@ public synchronized FetchItem getFetchItem() { return

[jira] [Created] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2022-01-15 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-2938: --- Summary: Use Any23's RepositoryWriter to write structured data to Rdf4j repository Key: NUTCH-2938 URL: https://issues.apache.org/jira/browse/NUTCH-2938

[jira] [Commented] (NUTCH-2919) NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

2022-01-15 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476720#comment-17476720 ] Hudson commented on NUTCH-2919: --- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #69 (See

[GitHub] [nutch] lewismc opened a new pull request #725: NUTCH-2938 Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2022-01-15 Thread GitBox
lewismc opened a new pull request #725: URL: https://github.com/apache/nutch/pull/725 PR addresses https://issues.apache.org/jira/browse/NUTCH-2938 We could improve the performance of this plugin if we could reuse the repository connection however I am not entirely sure how to do that

[jira] [Assigned] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2936: --- Assignee: Lewis John McGibbney > Early registration of URL stream handlers

[jira] [Work started] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2936 started by Lewis John McGibbney. --- > Early registration of URL stream handlers provided by plugins

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476719#comment-17476719 ] Lewis John McGibbney commented on NUTCH-2936: - I'll try to reproduce. Thanks > Early

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476730#comment-17476730 ] Lewis John McGibbney commented on NUTCH-2936: - I can reproduce this. Although I was planning

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476734#comment-17476734 ] ASF GitHub Bot commented on NUTCH-2936: --- lewismc opened a new pull request #726: URL:

[jira] [Commented] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2022-01-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476724#comment-17476724 ] ASF GitHub Bot commented on NUTCH-2938: --- lewismc opened a new pull request #725: URL:

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476728#comment-17476728 ] Lewis John McGibbney commented on NUTCH-2936: - [~snagel] which JDK are you using? > Early

[GitHub] [nutch] lewismc opened a new pull request #726: NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread GitBox
lewismc opened a new pull request #726: URL: https://github.com/apache/nutch/pull/726 I ended up producing this PR as a result of investigating NUTCH-2936. This PR does not fix NUTCH-2936. The problem is that the