[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770553#comment-17770553 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770552#comment-17770552 ] ASF GitHub Bot commented on NUTCH-2959: --- sebastian-nagel commented on PR #776: URL:

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770540#comment-17770540 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741263080 Alright, the only thing that I think _might_ work is Tika shading commons-io in tika-app, and then Nutch uses tika-app instead of the individual parser-modules etc. for parser-tika.

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770502#comment-17770502 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741143346 Stepping away from the keyboard. :sob: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770500#comment-17770500 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770493#comment-17770493 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741125973 There is just no winning... We just upgraded POI to 5.2.4, and it uses a bunch of the newer commons-io methods. If we downgrade POI to 5.2.3, we get a clean build of Tika with

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770481#comment-17770481 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741061515 I reverted back to 2.2.1, and that's not far enough back -- there were 222 parse failures many with the wrap problem. I reverted back to 2.0.0, and then had 85 parse failures again. This

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770477#comment-17770477 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741040471 >see the comments in [test_tika_parser.sh](https://github.com/sebastian-nagel/nutch-test-single-node-cluster/blob/master/test_tika_parser.sh) Sorry! Yep, saw that too late. --

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770476#comment-17770476 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison commented on PR #776: URL:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741039619 With the update to Tika 2.9.1-SNAPSHOT, I get 85 failed parses, most of them are either encrypted documents or "can't retrieve Tika Parser for x"

Re: Establishing a Nutch development roadmap

2023-09-29 Thread Sebastian Nagel
> Migrating javax->jakarta has been quite a chore on Tika because of > dependencies. Given back-compat issues with hadoop, is this even on the > horizon for Nutch? Good point. I think we are pretty free to replace javax packages in Nutch core and plugins - they're used in multiple classes. If

[jira] [Commented] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-29 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770320#comment-17770320 ] Sebastian Nagel commented on NUTCH-3006: > revert CloseShieldInputStream.wrap(), which I think

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770318#comment-17770318 ] ASF GitHub Bot commented on NUTCH-2959: --- sebastian-nagel commented on PR #776: URL:

[GitHub] [nutch] sebastian-nagel commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
sebastian-nagel commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1740397046 > what do I use for the tika seeds file? Are you using our github repo, or the > tika-parsers-common package specifically see the comments in