Sorry, more correctly:

OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
requires Java 17 and our 3.x is still on 11.

On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <talli...@apache.org> wrote:
>
> All,
>   As Brian pointed out, optimaize is no longer maintained, and it has
> some dependencies that have aged out. Should we replace our baseline
> langdetect in tika-app and tika-server in 3.x?
>   I'd say that we should go with our OpenNLP based language detection,
> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> Java 17.
>   Thoughts?
>
>             Best,
>
>                 Tim
>
> ---------- Forwarded message ---------
> From: Brian Laskey <blas...@us.ibm.com>
> Date: Fri, Mar 8, 2024 at 2:38 PM
> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> and parsers
> To: u...@tika.apache.org <u...@tika.apache.org>
>
>
> Hi Tim
>
>
>
> Thanks this is helpful.
>
>
>
> For tika-app we found the dependency on org.apache.tika »
> tika-langdetect-optimaize brings in some older 3rd party jars, and
> unfortunately it appears that the com.optimaize.languagedetector »
> language-detector 0.6 is unmaintained so it’s dependencies on
> vulnerable versions of guava (18.0) cause us problems with security
> scans. I could be wrong but I don’t believe we need this component for
> our usage of just detect and parse?
>
>
>
> We have a sort of microservice process (java based) which is ingesting
> files parsed from tika. It was nice that we could separate the tika
> process in it’s own heap space as a separate java process rather than
> adding it to our app, but I suppose we could work around that
>
>
>
> Thank you
>
> Brian Laskey
>
>
>
> From: Tim Allison <talli...@apache.org>
> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org>
> Date: Friday, March 8, 2024 at 9:44 AM
> To: "u...@tika.apache.org" <u...@tika.apache.org>
> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> tiki-core / and parsers
>
>
>
> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> tika-parsers-standard-package. Which components are you trying to
> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> some of its dependencies? I ask, because we
>
> Hi Brian,
>
>   A few thoughts:
>
>
>
> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> Which components are you trying to avoid? tika-serialization and
> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> because we could factor out a tika-app-core with no parsers in Tika
> 3.x, which is what we do now with tika-server-core and
> tika-server-standard.
>
>
>
> 2) Unrelated, there are probably more efficient ways of running Tika
> than calling it per file on the commandline. That is a robust option,
> at least!
>
>
>
> If all you want is detect and text extraction, and you want to run it
> from the commandline, write two classes, whose main()s call:
>
> System.out.println(Tika.detect(File f));
>
>
>
> or
>
>
>
> System.out.println(Tika.parseToString(File f))
>
>
>
> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <blas...@us.ibm.com> wrote:
>
> Hello Tika community,
>
>
>
> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> to something with more minimal third party dependencies which we can
> control.
>
>
>
> Is there any good documentation or pathway to describe how a team
> could map the tika-app functionality we use to the same behavior using
> just tika-core and tika-parsers-standard-package
>
> (I assume)?
>
>
>
> The tika-app functions we use today are:
>
>
>
> Mime-type detection
>
> java -jar tika-app.jar -d <file>
>
>
>
> and
>
> Text extraction attempts
>
> java -jar tika-app.jar -t <file>
>
>
>
> Is there a subset of tika parser jars we would need to include to have
> equivalent functionality if we wrote our own wrapper main class?
>
>
>
> Thank you,
>
> Brian Laskey

Reply via email to