Time to move on? Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is already there…. Java 11 is EOL and has been for a while….
Any other file parsers that are being optimized to take advantage of the newer features that are in recent Java versions that we know about? > On Apr 8, 2024, at 7:02 AM, Tim Allison <talli...@apache.org> wrote: > > Sorry, more correctly: > > OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0 > requires Java 17 and our 3.x is still on 11. > > On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <talli...@apache.org> wrote: >> >> All, >> As Brian pointed out, optimaize is no longer maintained, and it has >> some dependencies that have aged out. Should we replace our baseline >> langdetect in tika-app and tika-server in 3.x? >> I'd say that we should go with our OpenNLP based language detection, >> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires >> Java 17. >> Thoughts? >> >> Best, >> >> Tim >> >> ---------- Forwarded message --------- >> From: Brian Laskey <blas...@us.ibm.com> >> Date: Fri, Mar 8, 2024 at 2:38 PM >> Subject: RE: Replacing full tika-app.jar to directly using tiki-core / >> and parsers >> To: u...@tika.apache.org <u...@tika.apache.org> >> >> >> Hi Tim >> >> >> >> Thanks this is helpful. >> >> >> >> For tika-app we found the dependency on org.apache.tika » >> tika-langdetect-optimaize brings in some older 3rd party jars, and >> unfortunately it appears that the com.optimaize.languagedetector » >> language-detector 0.6 is unmaintained so it’s dependencies on >> vulnerable versions of guava (18.0) cause us problems with security >> scans. I could be wrong but I don’t believe we need this component for >> our usage of just detect and parse? >> >> >> >> We have a sort of microservice process (java based) which is ingesting >> files parsed from tika. It was nice that we could separate the tika >> process in it’s own heap space as a separate java process rather than >> adding it to our app, but I suppose we could work around that >> >> >> >> Thank you >> >> Brian Laskey >> >> >> >> From: Tim Allison <talli...@apache.org> >> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org> >> Date: Friday, March 8, 2024 at 9:44 AM >> To: "u...@tika.apache.org" <u...@tika.apache.org> >> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using >> tiki-core / and parsers >> >> >> >> Hi Brian, A few thoughts: 1) tika-app is basically tika-core + >> tika-parsers-standard-package. Which components are you trying to >> avoid? tika-serialization and jackson? boilerpipecontenthandler and >> some of its dependencies? I ask, because we >> >> Hi Brian, >> >> A few thoughts: >> >> >> >> 1) tika-app is basically tika-core + tika-parsers-standard-package. >> Which components are you trying to avoid? tika-serialization and >> jackson? boilerpipecontenthandler and some of its dependencies? I ask, >> because we could factor out a tika-app-core with no parsers in Tika >> 3.x, which is what we do now with tika-server-core and >> tika-server-standard. >> >> >> >> 2) Unrelated, there are probably more efficient ways of running Tika >> than calling it per file on the commandline. That is a robust option, >> at least! >> >> >> >> If all you want is detect and text extraction, and you want to run it >> from the commandline, write two classes, whose main()s call: >> >> System.out.println(Tika.detect(File f)); >> >> >> >> or >> >> >> >> System.out.println(Tika.parseToString(File f)) >> >> >> >> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <blas...@us.ibm.com> wrote: >> >> Hello Tika community, >> >> >> >> Our team is migrating away from usage of tika-app.jar (2.6 currently) >> to something with more minimal third party dependencies which we can >> control. >> >> >> >> Is there any good documentation or pathway to describe how a team >> could map the tika-app functionality we use to the same behavior using >> just tika-core and tika-parsers-standard-package >> >> (I assume)? >> >> >> >> The tika-app functions we use today are: >> >> >> >> Mime-type detection >> >> java -jar tika-app.jar -d <file> >> >> >> >> and >> >> Text extraction attempts >> >> java -jar tika-app.jar -t <file> >> >> >> >> Is there a subset of tika parser jars we would need to include to have >> equivalent functionality if we wrote our own wrapper main class? >> >> >> >> Thank you, >> >> Brian Laskey _______________________ Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.