Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
>From October 2023:
https://www.brilworks.com/blog/java-11-countdown-to-end-of-support/

Getting 3.x out has taken longer than I had anticipated. Should we
reopen the 17 vs 11 discussion given Eric's input? Or do we continue
with the plan to target 11 in 3x for the foreseeable future?

On Mon, Apr 8, 2024 at 9:22 AM Eric Pugh
 wrote:
>
> Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP 
> is already there….Java 11 is EOL and has been for a while….
>
> Any other file parsers that are being optimized to take advantage of the 
> newer features that are in recent Java versions that we know about?
>
> > On Apr 8, 2024, at 7:02 AM, Tim Allison  wrote:
> >
> > Sorry, more correctly:
> >
> > OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> > requires Java 17 and our 3.x is still on 11.
> >
> > On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
> >>
> >> All,
> >>  As Brian pointed out, optimaize is no longer maintained, and it has
> >> some dependencies that have aged out. Should we replace our baseline
> >> langdetect in tika-app and tika-server in 3.x?
> >>  I'd say that we should go with our OpenNLP based language detection,
> >> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> >> Java 17.
> >>  Thoughts?
> >>
> >>Best,
> >>
> >>Tim
> >>
> >> -- Forwarded message -
> >> From: Brian Laskey 
> >> Date: Fri, Mar 8, 2024 at 2:38 PM
> >> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> >> and parsers
> >> To: u...@tika.apache.org 
> >>
> >>
> >> Hi Tim
> >>
> >>
> >>
> >> Thanks this is helpful.
> >>
> >>
> >>
> >> For tika-app we found the dependency on org.apache.tika »
> >> tika-langdetect-optimaize brings in some older 3rd party jars, and
> >> unfortunately it appears that the com.optimaize.languagedetector »
> >> language-detector 0.6 is unmaintained so it’s dependencies on
> >> vulnerable versions of guava (18.0) cause us problems with security
> >> scans. I could be wrong but I don’t believe we need this component for
> >> our usage of just detect and parse?
> >>
> >>
> >>
> >> We have a sort of microservice process (java based) which is ingesting
> >> files parsed from tika. It was nice that we could separate the tika
> >> process in it’s own heap space as a separate java process rather than
> >> adding it to our app, but I suppose we could work around that
> >>
> >>
> >>
> >> Thank you
> >>
> >> Brian Laskey
> >>
> >>
> >>
> >> From: Tim Allison 
> >> Reply-To: "u...@tika.apache.org" 
> >> Date: Friday, March 8, 2024 at 9:44 AM
> >> To: "u...@tika.apache.org" 
> >> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> >> tiki-core / and parsers
> >>
> >>
> >>
> >> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> >> tika-parsers-standard-package. Which components are you trying to
> >> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> >> some of its dependencies? I ask, because we
> >>
> >> Hi Brian,
> >>
> >>  A few thoughts:
> >>
> >>
> >>
> >> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> >> Which components are you trying to avoid? tika-serialization and
> >> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> >> because we could factor out a tika-app-core with no parsers in Tika
> >> 3.x, which is what we do now with tika-server-core and
> >> tika-server-standard.
> >>
> >>
> >>
> >> 2) Unrelated, there are probably more efficient ways of running Tika
> >> than calling it per file on the commandline. That is a robust option,
> >> at least!
> >>
> >>
> >>
> >> If all you want is detect and text extraction, and you want to run it
> >> from the commandline, write two classes, whose main()s call:
> >>
> >> System.out.println(Tika.detect(File f));
> >>
> >>
> >>
> >> or
> >>
> >>
> >>
> >> System.out.println(Tika.parseToString(File f))
> >>
> >>
> >>
> >> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
> >>
> >> Hello Tika community,
> >>
> >>
> >>
> >> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> >> to something with more minimal third party dependencies which we can
> >> control.
> >>
> >>
> >>
> >> Is there any good documentation or pathway to describe how a team
> >> could map the tika-app functionality we use to the same behavior using
> >> just tika-core and tika-parsers-standard-package
> >>
> >> (I assume)?
> >>
> >>
> >>
> >> The tika-app functions we use today are:
> >>
> >>
> >>
> >> Mime-type detection
> >>
> >> java -jar tika-app.jar -d 
> >>
> >>
> >>
> >> and
> >>
> >> Text extraction attempts
> >>
> >> java -jar tika-app.jar -t 
> >>
> >>
> >>
> >> Is there a subset of tika parser jars we would need to include to have
> >> equivalent functionality if we wrote our own wrapper main class?
> >>
> >>
> >>
> >> Thank you,
> >>
> >> Brian Laskey
>
> ___
> Eric Pugh | Founder | OpenSource Connections, LLC | 

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Eric Pugh
Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is 
already there….Java 11 is EOL and has been for a while….   

Any other file parsers that are being optimized to take advantage of the newer 
features that are in recent Java versions that we know about?   

> On Apr 8, 2024, at 7:02 AM, Tim Allison  wrote:
> 
> Sorry, more correctly:
> 
> OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> requires Java 17 and our 3.x is still on 11.
> 
> On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
>> 
>> All,
>>  As Brian pointed out, optimaize is no longer maintained, and it has
>> some dependencies that have aged out. Should we replace our baseline
>> langdetect in tika-app and tika-server in 3.x?
>>  I'd say that we should go with our OpenNLP based language detection,
>> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
>> Java 17.
>>  Thoughts?
>> 
>>Best,
>> 
>>Tim
>> 
>> -- Forwarded message -
>> From: Brian Laskey 
>> Date: Fri, Mar 8, 2024 at 2:38 PM
>> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
>> and parsers
>> To: u...@tika.apache.org 
>> 
>> 
>> Hi Tim
>> 
>> 
>> 
>> Thanks this is helpful.
>> 
>> 
>> 
>> For tika-app we found the dependency on org.apache.tika »
>> tika-langdetect-optimaize brings in some older 3rd party jars, and
>> unfortunately it appears that the com.optimaize.languagedetector »
>> language-detector 0.6 is unmaintained so it’s dependencies on
>> vulnerable versions of guava (18.0) cause us problems with security
>> scans. I could be wrong but I don’t believe we need this component for
>> our usage of just detect and parse?
>> 
>> 
>> 
>> We have a sort of microservice process (java based) which is ingesting
>> files parsed from tika. It was nice that we could separate the tika
>> process in it’s own heap space as a separate java process rather than
>> adding it to our app, but I suppose we could work around that
>> 
>> 
>> 
>> Thank you
>> 
>> Brian Laskey
>> 
>> 
>> 
>> From: Tim Allison 
>> Reply-To: "u...@tika.apache.org" 
>> Date: Friday, March 8, 2024 at 9:44 AM
>> To: "u...@tika.apache.org" 
>> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
>> tiki-core / and parsers
>> 
>> 
>> 
>> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
>> tika-parsers-standard-package. Which components are you trying to
>> avoid? tika-serialization and jackson? boilerpipecontenthandler and
>> some of its dependencies? I ask, because we
>> 
>> Hi Brian,
>> 
>>  A few thoughts:
>> 
>> 
>> 
>> 1) tika-app is basically tika-core + tika-parsers-standard-package.
>> Which components are you trying to avoid? tika-serialization and
>> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
>> because we could factor out a tika-app-core with no parsers in Tika
>> 3.x, which is what we do now with tika-server-core and
>> tika-server-standard.
>> 
>> 
>> 
>> 2) Unrelated, there are probably more efficient ways of running Tika
>> than calling it per file on the commandline. That is a robust option,
>> at least!
>> 
>> 
>> 
>> If all you want is detect and text extraction, and you want to run it
>> from the commandline, write two classes, whose main()s call:
>> 
>> System.out.println(Tika.detect(File f));
>> 
>> 
>> 
>> or
>> 
>> 
>> 
>> System.out.println(Tika.parseToString(File f))
>> 
>> 
>> 
>> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
>> 
>> Hello Tika community,
>> 
>> 
>> 
>> Our team is migrating away from usage of tika-app.jar (2.6 currently)
>> to something with more minimal third party dependencies which we can
>> control.
>> 
>> 
>> 
>> Is there any good documentation or pathway to describe how a team
>> could map the tika-app functionality we use to the same behavior using
>> just tika-core and tika-parsers-standard-package
>> 
>> (I assume)?
>> 
>> 
>> 
>> The tika-app functions we use today are:
>> 
>> 
>> 
>> Mime-type detection
>> 
>> java -jar tika-app.jar -d 
>> 
>> 
>> 
>> and
>> 
>> Text extraction attempts
>> 
>> java -jar tika-app.jar -t 
>> 
>> 
>> 
>> Is there a subset of tika parser jars we would need to include to have
>> equivalent functionality if we wrote our own wrapper main class?
>> 
>> 
>> 
>> Thank you,
>> 
>> Brian Laskey

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
Sorry, more correctly:

OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
requires Java 17 and our 3.x is still on 11.

On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
>
> All,
>   As Brian pointed out, optimaize is no longer maintained, and it has
> some dependencies that have aged out. Should we replace our baseline
> langdetect in tika-app and tika-server in 3.x?
>   I'd say that we should go with our OpenNLP based language detection,
> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> Java 17.
>   Thoughts?
>
> Best,
>
> Tim
>
> -- Forwarded message -
> From: Brian Laskey 
> Date: Fri, Mar 8, 2024 at 2:38 PM
> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> and parsers
> To: u...@tika.apache.org 
>
>
> Hi Tim
>
>
>
> Thanks this is helpful.
>
>
>
> For tika-app we found the dependency on org.apache.tika »
> tika-langdetect-optimaize brings in some older 3rd party jars, and
> unfortunately it appears that the com.optimaize.languagedetector »
> language-detector 0.6 is unmaintained so it’s dependencies on
> vulnerable versions of guava (18.0) cause us problems with security
> scans. I could be wrong but I don’t believe we need this component for
> our usage of just detect and parse?
>
>
>
> We have a sort of microservice process (java based) which is ingesting
> files parsed from tika. It was nice that we could separate the tika
> process in it’s own heap space as a separate java process rather than
> adding it to our app, but I suppose we could work around that
>
>
>
> Thank you
>
> Brian Laskey
>
>
>
> From: Tim Allison 
> Reply-To: "u...@tika.apache.org" 
> Date: Friday, March 8, 2024 at 9:44 AM
> To: "u...@tika.apache.org" 
> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> tiki-core / and parsers
>
>
>
> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> tika-parsers-standard-package. Which components are you trying to
> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> some of its dependencies? I ask, because we
>
> Hi Brian,
>
>   A few thoughts:
>
>
>
> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> Which components are you trying to avoid? tika-serialization and
> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> because we could factor out a tika-app-core with no parsers in Tika
> 3.x, which is what we do now with tika-server-core and
> tika-server-standard.
>
>
>
> 2) Unrelated, there are probably more efficient ways of running Tika
> than calling it per file on the commandline. That is a robust option,
> at least!
>
>
>
> If all you want is detect and text extraction, and you want to run it
> from the commandline, write two classes, whose main()s call:
>
> System.out.println(Tika.detect(File f));
>
>
>
> or
>
>
>
> System.out.println(Tika.parseToString(File f))
>
>
>
> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
>
> Hello Tika community,
>
>
>
> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> to something with more minimal third party dependencies which we can
> control.
>
>
>
> Is there any good documentation or pathway to describe how a team
> could map the tika-app functionality we use to the same behavior using
> just tika-core and tika-parsers-standard-package
>
> (I assume)?
>
>
>
> The tika-app functions we use today are:
>
>
>
> Mime-type detection
>
> java -jar tika-app.jar -d 
>
>
>
> and
>
> Text extraction attempts
>
> java -jar tika-app.jar -t 
>
>
>
> Is there a subset of tika parser jars we would need to include to have
> equivalent functionality if we wrote our own wrapper main class?
>
>
>
> Thank you,
>
> Brian Laskey