Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-19 Thread via GitHub


sandeshkr419 commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2007895625

   Hi @tballison & @gastaldi, 
   
   I was trying to upgrade `tika-parsers-standard-package` 2.6 -> 2.8/2.9 as 
after updating `commons-compress` to 1.25.0, I was facing issues with parsing 
`IWorkPackageParser` related files (.pages, .key). 
   
   Here are more details: 
https://github.com/opensearch-project/OpenSearch/pull/12627
   
   When I was bumping up tika dependencies to 2.8.0 or 2.9.0 or 2.9.1, I was 
not able to utilize the various parsers which were part of 
`tika-parsers-standard-package` jar such as `HtmlParser`, etc listed 
[here](https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java#L92).
   
   After the changes in package structure in `tika-parsers-standard-package` 
since 2.7.0, is there a change in how the dependencies are consumed now? Any 
documented way which I can refer to on how to consume the various parser 
implementations which are now not available in the 
`tika-parsers-standard-package` now.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-19 Thread via GitHub


gastaldi commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2007927352

   What error are you getting? If `cannot access 
org.apache.tika.parser.AbstractEncodingDetectorParser`make sure you also add a 
dependency to `tika-core`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-19 Thread via GitHub


sandeshkr419 commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2007939574

   @gastaldi Thanks for the quick revert.
   
   These are the present tika libraries that I'm consuming:
   
   ```
   versions << [
 'tika'  : '2.6.0',
'commonscompress' : '1.24.0'
   .
   .
   .
 api "org.apache.tika:tika-core:${versions.tika}"
 api "org.apache.tika:tika-parsers:${versions.tika}"
 api "org.apache.tika:tika-parsers-standard-package:${versions.tika}"
 api "org.apache.tika:tika-langdetect-optimaize:${versions.tika}"
   
 api "org.apache.commons:commons-compress:${versions.commonscompress}
   ```
   
   **With tika version:2.6.0, and commons-compress 1.24.0:**
   Everything worked fine.
   
   
   **With tika version:2.6.0, and commons-compress 1.26.0:**
   IWorkerParser related parsing methods started throwing exceptions:
   
   ```
   org.opensearch.ingest.attachment.TikaDocTests > testFiles FAILED
   java.lang.RuntimeException: parsing of filename: testKeynote.key failed
   at 
__randomizedtesting.SeedInfo.seed([7E30995C8CE0CC1:6EFE6C139A13FF43]:0)
   at 
org.opensearch.ingest.attachment.TikaDocTests.assertParseable(TikaDocTests.java:85)
   at 
org.opensearch.ingest.attachment.TikaDocTests.testFiles(TikaDocTests.java:71)
   
   Caused by:
   org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.iwork.IWorkPackageParser@3ba82e1d
   at 
app//org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
   at 
app//org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:195)
   at app//org.apache.tika.Tika.parseToString(Tika.java:525)
   at 
app//org.opensearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:122)
   at 
java.base@21.0.2/java.security.AccessController.doPrivileged(AccessController.java:714)
   at 
app//org.opensearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:121)
   at 
app//org.opensearch.ingest.attachment.TikaDocTests.assertParseable(TikaDocTests.java:80)
   ... 1 more
   
   Caused by:
   java.io.IOException: Resetting to invalid mark
   at 
java.base/java.io.BufferedInputStream.implReset(BufferedInputStream.java:583)
   at 
java.base/java.io.BufferedInputStream.reset(BufferedInputStream.java:569)
   at 
org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:97)
   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
   ... 7 more
   ```
   
   
   **With tika version:2.8.0, and commons-compress 1.26.0:**
   
   The following dependencies fail to resolve:
   ```
   
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:94:
 error: package org.apache.tika.parser.html does not exist
   new org.apache.tika.parser.html.HtmlParser(),
  ^
   
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:95:
 error: package org.apache.tika.parser.pdf does not exist
   new org.apache.tika.parser.pdf.PDFParser(),
 ^
   
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:96:
 error: package org.apache.tika.parser.txt does not exist
   new org.apache.tika.parser.txt.TXTParser(),
 ^
   
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:97:
 error: package org.apache.tika.parser.microsoft.rtf does not exist
   new org.apache.tika.parser.microsoft.rtf.RTFParser(),
   ^
   
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:98:
 error: package org.apache.tika.parser.microsoft does not exist
   new org.apache.tika.parser.microsoft.OfficeParser(),
   ^
   
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:99:
 error: package org.apache.tika.parser.microsoft does not exist
   new org.apache.tika.parser.microsoft.OldExcelParser(),
   ^
   
/Users/kusandes/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:100:
 error: package org.apache.tika.parser.microsoft.ooxml does not exist
   ParserDecorator.withoutTypes(new 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUD

Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


gastaldi commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2009549387

   No idea what can be causing that, perhaps @tballison might know 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


tballison commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010280292

   Sorry, I haven't looked carefully at your gradle file...is it pulling in 
transitive dependencies, like `tika-parser-misc-office-module` for example?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


tballison commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010281749

   I think the iworks and compress thing is fixed in 1.26.1. @THausherr does 
that sound right? The iworks issue rings a bell...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


THausherr commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010370590

   Yes; although I see that your last improvement wasn't added to 2.9.2, I'll 
do it.
   
   @gastaldi you can test with a snapshot
   
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.9.2-SNAPSHOT/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


tballison commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010467082

   Doh! Thank you, @THausherr . I'm happy to cherry-pick that bit as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


sandeshkr419 commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010476782

   Thanks @tballison and @THausherr for the quick help on addressing this. 
   
   I still have a blocker in consuming 2.8 and above tika dependencies.
   Would you happen to have any insights on any other configuration changes 
that we need to accommodate as well.
   I have noted the errors on upgrading to 2.8.0 and above where some classes 
are not resolved: 
https://github.com/apache/tika/pull/1130#issuecomment-2007939574 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


tballison commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010508279

   https://github.com/apache/tika/pull/1130#issuecomment-2010280292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-20 Thread via GitHub


sandeshkr419 commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2010569496

   Thanks @tballison @THausherr - I'm able to upgrade tika now. 
   Last question, when are we expecting 2.9.2 to be available/released?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-21 Thread via GitHub


THausherr commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2011621548

   Some time next week, see the message in the mailing lists: "I'd like to fix 
TIKA-4211 before the next release. It has been a while since our last 2.x 
release. What do you think about aiming for starting the voting process early 
next week? Any other blockers?"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4038: Remove shading of `tika-parsers-standard-package` [tika]

2024-03-24 Thread via GitHub


THausherr commented on PR #1130:
URL: https://github.com/apache/tika/pull/1130#issuecomment-2016791966

   Also observe the mass regression tests in 
https://issues.apache.org/jira/browse/TIKA-4171 . We hit several problems 
yesterday and these must be solved first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org