[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-26 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769299#comment-17769299 ] ASF GitHub Bot commented on NUTCH-2959: --- sebastian-nagel commented on PR #776: URL:

[jira] [Created] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-26 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3006: -- Summary: Downgrade Tika dependency to 2.2.1 (core and parse-tika) Key: NUTCH-3006 URL: https://issues.apache.org/jira/browse/NUTCH-3006 Project: Nutch

[GitHub] [nutch] sebastian-nagel commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-26 Thread via GitHub
sebastian-nagel commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1736008780 > I suggest that we downgrade to Tika 2.2.1 to fix that regression. Good point, @lewismc. I've opened NUTCH-3006 for that. -- This is an automated message from the Apache

[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769293#comment-17769293 ] ASF GitHub Bot commented on NUTCH-2990: --- sebastian-nagel commented on PR #779: URL:

[GitHub] [nutch] sebastian-nagel commented on pull request #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub
sebastian-nagel commented on PR #779: URL: https://github.com/apache/nutch/pull/779#issuecomment-1735968193 > an example on hand of a robots.txt which can be fetched with >1 redirects? http://wikipedia.org/robots.txt Note: works with protocol-http, for protocol-okhttp need

Establishing a Nutch development roadmap

2023-09-26 Thread lewis john mcgibbney
Hi dev@, I've been at arms length for a while as $dayjob changed and then changed again over the last number of years. With that being said, I wanted to start a thread on $title with the goal of establishing some "big items" we could put on the roadmap and maybe even publish... Here are some of

[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769255#comment-17769255 ] ASF GitHub Bot commented on NUTCH-2990: --- lewismc commented on PR #779: URL:

[GitHub] [nutch] lewismc commented on pull request #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub
lewismc commented on PR #779: URL: https://github.com/apache/nutch/pull/779#issuecomment-1735761972 Very nice @sebastian-nagel Do you have an example on hand of a robots.txt which can be fetched with >1 redirects? -- This is an automated message from the Apache Git Service. To

[jira] [Created] (NUTCH-3005) Upgrade selenium as needed

2023-09-26 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3005: -- Summary: Upgrade selenium as needed Key: NUTCH-3005 URL: https://issues.apache.org/jira/browse/NUTCH-3005 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769168#comment-17769168 ] Hudson commented on NUTCH-3004: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #115 (See

Jenkins build is back to normal : Nutch » Nutch-trunk #115

2023-09-26 Thread Apache Jenkins Server
See

Build failed in Jenkins: Nutch » Nutch-trunk #114

2023-09-26 Thread Apache Jenkins Server
See Changes: -- Started by an SCM change Running as SYSTEM [EnvInject] - Loading node environment variables. Building remotely on builds58 (ubuntu) in workspace

Build failed in Jenkins: Nutch » Nutch-trunk #113

2023-09-26 Thread Apache Jenkins Server
See Changes: -- Started by an SCM change Running as SYSTEM [EnvInject] - Loading node environment variables. Building remotely on builds58 (ubuntu) in workspace

[jira] [Resolved] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved NUTCH-3004. Resolution: Fixed > Avoid NPE in HttpResponse > - > > Key:

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-26 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769097#comment-17769097 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-26 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1735261857 Converting this to draft until Hadoop 3.4.0 is released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[jira] [Commented] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769096#comment-17769096 ] ASF GitHub Bot commented on NUTCH-3004: --- tballison merged PR #778: URL:

[GitHub] [nutch] tballison merged pull request #778: NUTCH-3004

2023-09-26 Thread via GitHub
tballison merged PR #778: URL: https://github.com/apache/nutch/pull/778 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Fix Version/s: 1.20 > Avoid NPE in HttpResponse > - > >

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Component/s: plugin protocol > Avoid NPE in HttpResponse >

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Affects Version/s: 1.19 > Avoid NPE in HttpResponse > - > >

[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769045#comment-17769045 ] ASF GitHub Bot commented on NUTCH-2990: --- sebastian-nagel opened a new pull request, #779: URL:

[GitHub] [nutch] sebastian-nagel opened a new pull request, #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub
sebastian-nagel opened a new pull request, #779: URL: https://github.com/apache/nutch/pull/779 - follow multiple redirects when fetching robots.txt - number of followed redirects is configurable by the property `http.robots.redirect.max` (default: 5) - improvements in