Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
sandeshkr419 commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2007895625 Hi @tballison & @gastaldi, I was trying to upgrade `tika-parsers-standard-package` 2.6 -> 2.8/2.9 as after updating `commons-compress` to 1.25.0, I was facing issues with parsing `IWorkPackageParser` related files (.pages, .key). Here are more details: https://github.com/opensearch-project/OpenSearch/pull/12627 When I was bumping up tika dependencies to 2.8.0 or 2.9.0 or 2.9.1, I was not able to utilize the various parsers which were part of `tika-parsers-standard-package` jar such as `HtmlParser`, etc listed [here](https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java#L92). After the changes in package structure in `tika-parsers-standard-package` since 2.7.0, is there a change in how the dependencies are consumed now? Any documented way which I can refer to on how to consume the various parser implementations which are now not available in the `tika-parsers-standard-package` now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
gastaldi commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2007927352 What error are you getting? If `cannot access org.apache.tika.parser.AbstractEncodingDetectorParser`make sure you also add a dependency to `tika-core` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
sandeshkr419 commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2007939574 @gastaldi Thanks for the quick revert. These are the present tika libraries that I'm consuming: ``` versions << [ 'tika' : '2.6.0', 'commonscompress' : '1.24.0' . . . api "org.apache.tika:tika-core:${versions.tika}" api "org.apache.tika:tika-parsers:${versions.tika}" api "org.apache.tika:tika-parsers-standard-package:${versions.tika}" api "org.apache.tika:tika-langdetect-optimaize:${versions.tika}" api "org.apache.commons:commons-compress:${versions.commonscompress} ``` **With tika version:2.6.0, and commons-compress 1.24.0:** Everything worked fine. **With tika version:2.6.0, and commons-compress 1.26.0:** IWorkerParser related parsing methods started throwing exceptions: ``` org.opensearch.ingest.attachment.TikaDocTests > testFiles FAILED java.lang.RuntimeException: parsing of filename: testKeynote.key failed at __randomizedtesting.SeedInfo.seed([7E30995C8CE0CC1:6EFE6C139A13FF43]:0) at org.opensearch.ingest.attachment.TikaDocTests.assertParseable(TikaDocTests.java:85) at org.opensearch.ingest.attachment.TikaDocTests.testFiles(TikaDocTests.java:71) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.iwork.IWorkPackageParser@3ba82e1d at app//org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) at app//org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:195) at app//org.apache.tika.Tika.parseToString(Tika.java:525) at app//org.opensearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:122) at java.base@21.0.2/java.security.AccessController.doPrivileged(AccessController.java:714) at app//org.opensearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:121) at app//org.opensearch.ingest.attachment.TikaDocTests.assertParseable(TikaDocTests.java:80) ... 1 more Caused by: java.io.IOException: Resetting to invalid mark at java.base/java.io.BufferedInputStream.implReset(BufferedInputStream.java:583) at java.base/java.io.BufferedInputStream.reset(BufferedInputStream.java:569) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:97) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 7 more ``` **With tika version:2.8.0, and commons-compress 1.26.0:** The following dependencies fail to resolve: ``` /Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:94: error: package org.apache.tika.parser.html does not exist new org.apache.tika.parser.html.HtmlParser(), ^ /Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:95: error: package org.apache.tika.parser.pdf does not exist new org.apache.tika.parser.pdf.PDFParser(), ^ /Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:96: error: package org.apache.tika.parser.txt does not exist new org.apache.tika.parser.txt.TXTParser(), ^ /Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:97: error: package org.apache.tika.parser.microsoft.rtf does not exist new org.apache.tika.parser.microsoft.rtf.RTFParser(), ^ /Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:98: error: package org.apache.tika.parser.microsoft does not exist new org.apache.tika.parser.microsoft.OfficeParser(), ^ /Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:99: error: package org.apache.tika.parser.microsoft does not exist new org.apache.tika.parser.microsoft.OldExcelParser(), ^ /Users/kusandes/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:100: error: package org.apache.tika.parser.microsoft.ooxml does not exist ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUD
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
gastaldi commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2009549387 No idea what can be causing that, perhaps @tballison might know -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
tballison commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010280292 Sorry, I haven't looked carefully at your gradle file...is it pulling in transitive dependencies, like `tika-parser-misc-office-module` for example? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
tballison commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010281749 I think the iworks and compress thing is fixed in 1.26.1. @THausherr does that sound right? The iworks issue rings a bell... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
THausherr commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010370590 Yes; although I see that your last improvement wasn't added to 2.9.2, I'll do it. @gastaldi you can test with a snapshot https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.9.2-SNAPSHOT/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
tballison commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010467082 Doh! Thank you, @THausherr . I'm happy to cherry-pick that bit as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
sandeshkr419 commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010476782 Thanks @tballison and @THausherr for the quick help on addressing this. I still have a blocker in consuming 2.8 and above tika dependencies. Would you happen to have any insights on any other configuration changes that we need to accommodate as well. I have noted the errors on upgrading to 2.8.0 and above where some classes are not resolved: https://github.com/apache/tika/pull/1130#issuecomment-2007939574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
tballison commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010508279 https://github.com/apache/tika/pull/1130#issuecomment-2010280292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
sandeshkr419 commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2010569496 Thanks @tballison @THausherr - I'm able to upgrade tika now. Last question, when are we expecting 2.9.2 to be available/released? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
THausherr commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2011621548 Some time next week, see the message in the mailing lists: "I'd like to fix TIKA-4211 before the next release. It has been a while since our last 2.x release. What do you think about aiming for starting the voting process early next week? Any other blockers?" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]
THausherr commented on PR #1130: URL: https://github.com/apache/tika/pull/1130#issuecomment-2016791966 Also observe the mass regression tests in https://issues.apache.org/jira/browse/TIKA-4171 . We hit several problems yesterday and these must be solved first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org