[jira] [Resolved] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved NUTCH-3000. Fix Version/s: 1.20 Resolution: Fixed > protocol-selenium returns only the body,strips off

[jira] [Resolved] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved NUTCH-3001. Fix Version/s: 1.20 Resolution: Fixed > protocol-selenium requires Content-Type header >

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764803#comment-17764803 ] Hudson commented on NUTCH-3000: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #110 (See

[jira] [Commented] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764802#comment-17764802 ] Hudson commented on NUTCH-3001: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #110 (See

[jira] [Commented] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764792#comment-17764792 ] ASF GitHub Bot commented on NUTCH-3001: --- tballison merged PR #774: URL:

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764791#comment-17764791 ] ASF GitHub Bot commented on NUTCH-3000: --- tballison merged PR #773: URL:

[GitHub] [nutch] tballison merged pull request #774: NUTCH-3001 - fix logic for grabbing bytes if there's no content type …

2023-09-13 Thread via GitHub
tballison merged PR #774: URL: https://github.com/apache/nutch/pull/774 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison merged pull request #773: NUTCH-3000 - the selenium protocol should return the full html, not just the inner body

2023-09-13 Thread via GitHub
tballison merged PR #773: URL: https://github.com/apache/nutch/pull/773 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [DISCUSS] Removing Any23 from Nutch?

2023-09-13 Thread Sebastian Nagel
+1 Since any23 also depends on tika-core, the plugin is likely to break if we upgrade to a more recent Tika version in Nutch core and the parse-tika plugin. ~Sebastian On 9/13/23 16:50, Tim Allison wrote: All,   I opened https://issues.apache.org/jira/browse/NUTCH-2998

[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764741#comment-17764741 ] Tim Allison commented on NUTCH-2998: Sorry, I botched the title in the PR:

[GitHub] [nutch] tballison commented on pull request #775: Remove Any23 from Nutch

2023-09-13 Thread via GitHub
tballison commented on PR #775: URL: https://github.com/apache/nutch/pull/775#issuecomment-1717820655 When I build this, I get this harmless (?) warning in `src/plugin/logs/hadoop.log`: ``` 2023-02-24 10:07:39,218 WARN o.a.n.p.PluginManifestParser [main] Error while loading

[GitHub] [nutch] tballison opened a new pull request, #775: Remove Any23 from Nutch

2023-09-13 Thread via GitHub
tballison opened a new pull request, #775: URL: https://github.com/apache/nutch/pull/775 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[DISCUSS] Removing Any23 from Nutch?

2023-09-13 Thread Tim Allison
All, I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks ago. Any23 was moved to the attic in June. Unless there are objections, I propose removing it from Nutch before the next release. Any objections? Best, Tim

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764722#comment-17764722 ] ASF GitHub Bot commented on NUTCH-2978: --- tballison commented on PR #772: URL:

[GitHub] [nutch] tballison commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-13 Thread via GitHub
tballison commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1717765669 If folks could test this out on their workloads, that'd be fantastic! It works on mine, but I'm really hesitant to merge until someone else runs it. Thank you! -- This is an automated

[jira] [Commented] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764720#comment-17764720 ] ASF GitHub Bot commented on NUTCH-3001: --- tballison opened a new pull request, #774: URL:

[GitHub] [nutch] tballison opened a new pull request, #774: NUTCH-3001 - fix logic for grabbing bytes if there's no content type …

2023-09-13 Thread via GitHub
tballison opened a new pull request, #774: URL: https://github.com/apache/nutch/pull/774 …in the header Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764718#comment-17764718 ] ASF GitHub Bot commented on NUTCH-3000: --- tballison opened a new pull request, #773: URL:

[GitHub] [nutch] tballison opened a new pull request, #773: NUTCH-3000 - the selenium protocol should return the full html, not just the inner body

2023-09-13 Thread via GitHub
tballison opened a new pull request, #773: URL: https://github.com/apache/nutch/pull/773 …ust the inner body element. Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that *

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764705#comment-17764705 ] Tim Allison commented on NUTCH-2978: I haven't tested in hadoop. I've just run it locally, and, for

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764699#comment-17764699 ] Markus Jelsma commented on NUTCH-2978: -- You managed to get it up and running, as well when deployed

[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated NUTCH-3001: --- Description: It looks like the selenium protocol requires that there be a content-type header.

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764697#comment-17764697 ] Markus Jelsma commented on NUTCH-3000: -- Yes, this is a bit odd indeed. +1 > protocol-selenium

[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated NUTCH-3001: --- Priority: Minor (was: Major) > protocol-selenium requires Content-Type header >

[jira] [Commented] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764698#comment-17764698 ] Tim Allison commented on NUTCH-3001: Or is the notion that if the selenium protocol doesn't pull any

[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated NUTCH-3001: --- Description: It looks like the selenium protocol requires that there be a content-type header.

[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated NUTCH-3001: --- Description: It looks like the selenium protocol requires that there be content-type. The logic

[jira] [Created] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3001: -- Summary: protocol-selenium requires Content-Type header Key: NUTCH-3001 URL: https://issues.apache.org/jira/browse/NUTCH-3001 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764692#comment-17764692 ] Sebastian Nagel commented on NUTCH-3000: +1 Yes, the full HTML seems the best choice for the

[jira] [Created] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3000: -- Summary: protocol-selenium returns only the body,strips off the element Key: NUTCH-3000 URL: https://issues.apache.org/jira/browse/NUTCH-3000 Project: Nutch

[jira] [Comment Edited] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672 ] Sebastian Nagel edited comment on NUTCH-2998 at 9/13/23 1:26 PM: - +1 >

[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672 ] Sebastian Nagel commented on NUTCH-2998: +1 > Remove the Any23 plugin > ---