Sorry, more correctly: OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0 requires Java 17 and our 3.x is still on 11.
On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <talli...@apache.org> wrote: > > All, > As Brian pointed out, optimaize is no longer maintained, and it has > some dependencies that have aged out. Should we replace our baseline > langdetect in tika-app and tika-server in 3.x? > I'd say that we should go with our OpenNLP based language detection, > but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires > Java 17. > Thoughts? > > Best, > > Tim > > ---------- Forwarded message --------- > From: Brian Laskey <blas...@us.ibm.com> > Date: Fri, Mar 8, 2024 at 2:38 PM > Subject: RE: Replacing full tika-app.jar to directly using tiki-core / > and parsers > To: u...@tika.apache.org <u...@tika.apache.org> > > > Hi Tim > > > > Thanks this is helpful. > > > > For tika-app we found the dependency on org.apache.tika » > tika-langdetect-optimaize brings in some older 3rd party jars, and > unfortunately it appears that the com.optimaize.languagedetector » > language-detector 0.6 is unmaintained so it’s dependencies on > vulnerable versions of guava (18.0) cause us problems with security > scans. I could be wrong but I don’t believe we need this component for > our usage of just detect and parse? > > > > We have a sort of microservice process (java based) which is ingesting > files parsed from tika. It was nice that we could separate the tika > process in it’s own heap space as a separate java process rather than > adding it to our app, but I suppose we could work around that > > > > Thank you > > Brian Laskey > > > > From: Tim Allison <talli...@apache.org> > Reply-To: "u...@tika.apache.org" <u...@tika.apache.org> > Date: Friday, March 8, 2024 at 9:44 AM > To: "u...@tika.apache.org" <u...@tika.apache.org> > Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using > tiki-core / and parsers > > > > Hi Brian, A few thoughts: 1) tika-app is basically tika-core + > tika-parsers-standard-package. Which components are you trying to > avoid? tika-serialization and jackson? boilerpipecontenthandler and > some of its dependencies? I ask, because we > > Hi Brian, > > A few thoughts: > > > > 1) tika-app is basically tika-core + tika-parsers-standard-package. > Which components are you trying to avoid? tika-serialization and > jackson? boilerpipecontenthandler and some of its dependencies? I ask, > because we could factor out a tika-app-core with no parsers in Tika > 3.x, which is what we do now with tika-server-core and > tika-server-standard. > > > > 2) Unrelated, there are probably more efficient ways of running Tika > than calling it per file on the commandline. That is a robust option, > at least! > > > > If all you want is detect and text extraction, and you want to run it > from the commandline, write two classes, whose main()s call: > > System.out.println(Tika.detect(File f)); > > > > or > > > > System.out.println(Tika.parseToString(File f)) > > > > On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <blas...@us.ibm.com> wrote: > > Hello Tika community, > > > > Our team is migrating away from usage of tika-app.jar (2.6 currently) > to something with more minimal third party dependencies which we can > control. > > > > Is there any good documentation or pathway to describe how a team > could map the tika-app functionality we use to the same behavior using > just tika-core and tika-parsers-standard-package > > (I assume)? > > > > The tika-app functions we use today are: > > > > Mime-type detection > > java -jar tika-app.jar -d <file> > > > > and > > Text extraction attempts > > java -jar tika-app.jar -t <file> > > > > Is there a subset of tika parser jars we would need to include to have > equivalent functionality if we wrote our own wrapper main class? > > > > Thank you, > > Brian Laskey