Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is 
already there….    Java 11 is EOL and has been for a while….   

Any other file parsers that are being optimized to take advantage of the newer 
features that are in recent Java versions that we know about?   

> On Apr 8, 2024, at 7:02 AM, Tim Allison <talli...@apache.org> wrote:
> 
> Sorry, more correctly:
> 
> OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> requires Java 17 and our 3.x is still on 11.
> 
> On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <talli...@apache.org> wrote:
>> 
>> All,
>>  As Brian pointed out, optimaize is no longer maintained, and it has
>> some dependencies that have aged out. Should we replace our baseline
>> langdetect in tika-app and tika-server in 3.x?
>>  I'd say that we should go with our OpenNLP based language detection,
>> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
>> Java 17.
>>  Thoughts?
>> 
>>            Best,
>> 
>>                Tim
>> 
>> ---------- Forwarded message ---------
>> From: Brian Laskey <blas...@us.ibm.com>
>> Date: Fri, Mar 8, 2024 at 2:38 PM
>> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
>> and parsers
>> To: u...@tika.apache.org <u...@tika.apache.org>
>> 
>> 
>> Hi Tim
>> 
>> 
>> 
>> Thanks this is helpful.
>> 
>> 
>> 
>> For tika-app we found the dependency on org.apache.tika »
>> tika-langdetect-optimaize brings in some older 3rd party jars, and
>> unfortunately it appears that the com.optimaize.languagedetector »
>> language-detector 0.6 is unmaintained so it’s dependencies on
>> vulnerable versions of guava (18.0) cause us problems with security
>> scans. I could be wrong but I don’t believe we need this component for
>> our usage of just detect and parse?
>> 
>> 
>> 
>> We have a sort of microservice process (java based) which is ingesting
>> files parsed from tika. It was nice that we could separate the tika
>> process in it’s own heap space as a separate java process rather than
>> adding it to our app, but I suppose we could work around that
>> 
>> 
>> 
>> Thank you
>> 
>> Brian Laskey
>> 
>> 
>> 
>> From: Tim Allison <talli...@apache.org>
>> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org>
>> Date: Friday, March 8, 2024 at 9:44 AM
>> To: "u...@tika.apache.org" <u...@tika.apache.org>
>> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
>> tiki-core / and parsers
>> 
>> 
>> 
>> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
>> tika-parsers-standard-package. Which components are you trying to
>> avoid? tika-serialization and jackson? boilerpipecontenthandler and
>> some of its dependencies? I ask, because we
>> 
>> Hi Brian,
>> 
>>  A few thoughts:
>> 
>> 
>> 
>> 1) tika-app is basically tika-core + tika-parsers-standard-package.
>> Which components are you trying to avoid? tika-serialization and
>> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
>> because we could factor out a tika-app-core with no parsers in Tika
>> 3.x, which is what we do now with tika-server-core and
>> tika-server-standard.
>> 
>> 
>> 
>> 2) Unrelated, there are probably more efficient ways of running Tika
>> than calling it per file on the commandline. That is a robust option,
>> at least!
>> 
>> 
>> 
>> If all you want is detect and text extraction, and you want to run it
>> from the commandline, write two classes, whose main()s call:
>> 
>> System.out.println(Tika.detect(File f));
>> 
>> 
>> 
>> or
>> 
>> 
>> 
>> System.out.println(Tika.parseToString(File f))
>> 
>> 
>> 
>> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <blas...@us.ibm.com> wrote:
>> 
>> Hello Tika community,
>> 
>> 
>> 
>> Our team is migrating away from usage of tika-app.jar (2.6 currently)
>> to something with more minimal third party dependencies which we can
>> control.
>> 
>> 
>> 
>> Is there any good documentation or pathway to describe how a team
>> could map the tika-app functionality we use to the same behavior using
>> just tika-core and tika-parsers-standard-package
>> 
>> (I assume)?
>> 
>> 
>> 
>> The tika-app functions we use today are:
>> 
>> 
>> 
>> Mime-type detection
>> 
>> java -jar tika-app.jar -d <file>
>> 
>> 
>> 
>> and
>> 
>> Text extraction attempts
>> 
>> java -jar tika-app.jar -t <file>
>> 
>> 
>> 
>> Is there a subset of tika parser jars we would need to include to have
>> equivalent functionality if we wrote our own wrapper main class?
>> 
>> 
>> 
>> Thank you,
>> 
>> Brian Laskey

_______________________
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to