date:20240408

Tika 3.0.0-BETA2?

2024-04-08 Thread Tim Allison

All,
  I'm now thinking it would make sense to have one more 3.x beta
release before the final 3.0.0. Are there any breaking changes that we
want to get into 3.x?
  I'd like to wait for COMPRESS-675 to be fixed and for COMPRESS-674
to be released before we release 3.0.0-BETA2. Any other items that we
want to wait for?
  What do you think?
  Thank you.

Best,

   Tim

Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison

All,
  As Brian pointed out, optimaize is no longer maintained, and it has
some dependencies that have aged out. Should we replace our baseline
langdetect in tika-app and tika-server in 3.x?
  I'd say that we should go with our OpenNLP based language detection,
but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
Java 17.
  Thoughts?

Best,

Tim

-- Forwarded message -
From: Brian Laskey 
Date: Fri, Mar 8, 2024 at 2:38 PM
Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
and parsers
To: u...@tika.apache.org 


Hi Tim



Thanks this is helpful.



For tika-app we found the dependency on org.apache.tika »
tika-langdetect-optimaize brings in some older 3rd party jars, and
unfortunately it appears that the com.optimaize.languagedetector »
language-detector 0.6 is unmaintained so it’s dependencies on
vulnerable versions of guava (18.0) cause us problems with security
scans. I could be wrong but I don’t believe we need this component for
our usage of just detect and parse?



We have a sort of microservice process (java based) which is ingesting
files parsed from tika. It was nice that we could separate the tika
process in it’s own heap space as a separate java process rather than
adding it to our app, but I suppose we could work around that



Thank you

Brian Laskey



From: Tim Allison 
Reply-To: "u...@tika.apache.org" 
Date: Friday, March 8, 2024 at 9:44 AM
To: "u...@tika.apache.org" 
Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
tiki-core / and parsers



Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
tika-parsers-standard-package. Which components are you trying to
avoid? tika-serialization and jackson? boilerpipecontenthandler and
some of its dependencies? I ask, because we

Hi Brian,

  A few thoughts:



1) tika-app is basically tika-core + tika-parsers-standard-package.
Which components are you trying to avoid? tika-serialization and
jackson? boilerpipecontenthandler and some of its dependencies? I ask,
because we could factor out a tika-app-core with no parsers in Tika
3.x, which is what we do now with tika-server-core and
tika-server-standard.



2) Unrelated, there are probably more efficient ways of running Tika
than calling it per file on the commandline. That is a robust option,
at least!



If all you want is detect and text extraction, and you want to run it
from the commandline, write two classes, whose main()s call:

System.out.println(Tika.detect(File f));



or



System.out.println(Tika.parseToString(File f))



On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:

Hello Tika community,



Our team is migrating away from usage of tika-app.jar (2.6 currently)
to something with more minimal third party dependencies which we can
control.



Is there any good documentation or pathway to describe how a team
could map the tika-app functionality we use to the same behavior using
just tika-core and tika-parsers-standard-package

(I assume)?



The tika-app functions we use today are:



Mime-type detection

java -jar tika-app.jar -d 



and

Text extraction attempts

java -jar tika-app.jar -t 



Is there a subset of tika parser jars we would need to include to have
equivalent functionality if we wrote our own wrapper main class?



Thank you,

Brian Laskey

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison

Sorry, more correctly:

OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
requires Java 17 and our 3.x is still on 11.

On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
>
> All,
>   As Brian pointed out, optimaize is no longer maintained, and it has
> some dependencies that have aged out. Should we replace our baseline
> langdetect in tika-app and tika-server in 3.x?
>   I'd say that we should go with our OpenNLP based language detection,
> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> Java 17.
>   Thoughts?
>
> Best,
>
> Tim
>
> -- Forwarded message -
> From: Brian Laskey 
> Date: Fri, Mar 8, 2024 at 2:38 PM
> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> and parsers
> To: u...@tika.apache.org 
>
>
> Hi Tim
>
>
>
> Thanks this is helpful.
>
>
>
> For tika-app we found the dependency on org.apache.tika »
> tika-langdetect-optimaize brings in some older 3rd party jars, and
> unfortunately it appears that the com.optimaize.languagedetector »
> language-detector 0.6 is unmaintained so it’s dependencies on
> vulnerable versions of guava (18.0) cause us problems with security
> scans. I could be wrong but I don’t believe we need this component for
> our usage of just detect and parse?
>
>
>
> We have a sort of microservice process (java based) which is ingesting
> files parsed from tika. It was nice that we could separate the tika
> process in it’s own heap space as a separate java process rather than
> adding it to our app, but I suppose we could work around that
>
>
>
> Thank you
>
> Brian Laskey
>
>
>
> From: Tim Allison 
> Reply-To: "u...@tika.apache.org" 
> Date: Friday, March 8, 2024 at 9:44 AM
> To: "u...@tika.apache.org" 
> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> tiki-core / and parsers
>
>
>
> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> tika-parsers-standard-package. Which components are you trying to
> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> some of its dependencies? I ask, because we
>
> Hi Brian,
>
>   A few thoughts:
>
>
>
> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> Which components are you trying to avoid? tika-serialization and
> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> because we could factor out a tika-app-core with no parsers in Tika
> 3.x, which is what we do now with tika-server-core and
> tika-server-standard.
>
>
>
> 2) Unrelated, there are probably more efficient ways of running Tika
> than calling it per file on the commandline. That is a robust option,
> at least!
>
>
>
> If all you want is detect and text extraction, and you want to run it
> from the commandline, write two classes, whose main()s call:
>
> System.out.println(Tika.detect(File f));
>
>
>
> or
>
>
>
> System.out.println(Tika.parseToString(File f))
>
>
>
> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
>
> Hello Tika community,
>
>
>
> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> to something with more minimal third party dependencies which we can
> control.
>
>
>
> Is there any good documentation or pathway to describe how a team
> could map the tika-app functionality we use to the same behavior using
> just tika-core and tika-parsers-standard-package
>
> (I assume)?
>
>
>
> The tika-app functions we use today are:
>
>
>
> Mime-type detection
>
> java -jar tika-app.jar -d 
>
>
>
> and
>
> Text extraction attempts
>
> java -jar tika-app.jar -t 
>
>
>
> Is there a subset of tika parser jars we would need to include to have
> equivalent functionality if we wrote our own wrapper main class?
>
>
>
> Thank you,
>
> Brian Laskey

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Eric Pugh

Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is 
already there….Java 11 is EOL and has been for a while….   

Any other file parsers that are being optimized to take advantage of the newer 
features that are in recent Java versions that we know about?   

> On Apr 8, 2024, at 7:02 AM, Tim Allison  wrote:
> 
> Sorry, more correctly:
> 
> OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> requires Java 17 and our 3.x is still on 11.
> 
> On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
>> 
>> All,
>>  As Brian pointed out, optimaize is no longer maintained, and it has
>> some dependencies that have aged out. Should we replace our baseline
>> langdetect in tika-app and tika-server in 3.x?
>>  I'd say that we should go with our OpenNLP based language detection,
>> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
>> Java 17.
>>  Thoughts?
>> 
>>Best,
>> 
>>Tim
>> 
>> -- Forwarded message -
>> From: Brian Laskey 
>> Date: Fri, Mar 8, 2024 at 2:38 PM
>> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
>> and parsers
>> To: u...@tika.apache.org 
>> 
>> 
>> Hi Tim
>> 
>> 
>> 
>> Thanks this is helpful.
>> 
>> 
>> 
>> For tika-app we found the dependency on org.apache.tika »
>> tika-langdetect-optimaize brings in some older 3rd party jars, and
>> unfortunately it appears that the com.optimaize.languagedetector »
>> language-detector 0.6 is unmaintained so it’s dependencies on
>> vulnerable versions of guava (18.0) cause us problems with security
>> scans. I could be wrong but I don’t believe we need this component for
>> our usage of just detect and parse?
>> 
>> 
>> 
>> We have a sort of microservice process (java based) which is ingesting
>> files parsed from tika. It was nice that we could separate the tika
>> process in it’s own heap space as a separate java process rather than
>> adding it to our app, but I suppose we could work around that
>> 
>> 
>> 
>> Thank you
>> 
>> Brian Laskey
>> 
>> 
>> 
>> From: Tim Allison 
>> Reply-To: "u...@tika.apache.org" 
>> Date: Friday, March 8, 2024 at 9:44 AM
>> To: "u...@tika.apache.org" 
>> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
>> tiki-core / and parsers
>> 
>> 
>> 
>> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
>> tika-parsers-standard-package. Which components are you trying to
>> avoid? tika-serialization and jackson? boilerpipecontenthandler and
>> some of its dependencies? I ask, because we
>> 
>> Hi Brian,
>> 
>>  A few thoughts:
>> 
>> 
>> 
>> 1) tika-app is basically tika-core + tika-parsers-standard-package.
>> Which components are you trying to avoid? tika-serialization and
>> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
>> because we could factor out a tika-app-core with no parsers in Tika
>> 3.x, which is what we do now with tika-server-core and
>> tika-server-standard.
>> 
>> 
>> 
>> 2) Unrelated, there are probably more efficient ways of running Tika
>> than calling it per file on the commandline. That is a robust option,
>> at least!
>> 
>> 
>> 
>> If all you want is detect and text extraction, and you want to run it
>> from the commandline, write two classes, whose main()s call:
>> 
>> System.out.println(Tika.detect(File f));
>> 
>> 
>> 
>> or
>> 
>> 
>> 
>> System.out.println(Tika.parseToString(File f))
>> 
>> 
>> 
>> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
>> 
>> Hello Tika community,
>> 
>> 
>> 
>> Our team is migrating away from usage of tika-app.jar (2.6 currently)
>> to something with more minimal third party dependencies which we can
>> control.
>> 
>> 
>> 
>> Is there any good documentation or pathway to describe how a team
>> could map the tika-app functionality we use to the same behavior using
>> just tika-core and tika-parsers-standard-package
>> 
>> (I assume)?
>> 
>> 
>> 
>> The tika-app functions we use today are:
>> 
>> 
>> 
>> Mime-type detection
>> 
>> java -jar tika-app.jar -d 
>> 
>> 
>> 
>> and
>> 
>> Text extraction attempts
>> 
>> java -jar tika-app.jar -t 
>> 
>> 
>> 
>> Is there a subset of tika parser jars we would need to include to have
>> equivalent functionality if we wrote our own wrapper main class?
>> 
>> 
>> 
>> Thank you,
>> 
>> Brian Laskey

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Re: [PR] Tika 4237 add jwt authentication ability to the http fetcher [tika]

2024-04-08 Thread via GitHub



bartek commented on code in PR #1712:
URL: https://github.com/apache/tika/pull/1712#discussion_r1555919713


##
tika-pipes/tika-fetchers/tika-fetcher-http/src/main/java/org/apache/tika/pipes/fetcher/http/jwt/JwtGenerator.java:
##
@@ -0,0 +1,64 @@
+package org.apache.tika.pipes.fetcher.http.jwt;
+
+import java.time.Instant;
+import java.time.temporal.ChronoUnit;
+import java.util.Date;
+
+import com.nimbusds.jose.JOSEException;
+import com.nimbusds.jose.JWSAlgorithm;
+import com.nimbusds.jose.JWSHeader;
+import com.nimbusds.jose.JWSSigner;
+import com.nimbusds.jose.crypto.MACSigner;
+import com.nimbusds.jose.crypto.RSASSASigner;
+import com.nimbusds.jwt.JWTClaimsSet;
+import com.nimbusds.jwt.SignedJWT;
+
+public class JwtGenerator {

Review Comment:
   One thing we need to support is the `qsh` claim addition for Atlassian JWTs, 
see: 
https://developer.atlassian.com/cloud/jira/platform/understanding-jwt-for-connect-apps/
   
   This `qsh` claim takes the incoming request URL (ie download URL) and 
computes a hash from it. My thinking is we could send the calculated value of 
the `qsh` claim as part of FetchAndParse, and it would be amended to the claims 
during token generation. Every request would of course create a new JWT as this 
`qsh` claim value would be different.
   
   Is that possible in this workflow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison

>From October 2023:
https://www.brilworks.com/blog/java-11-countdown-to-end-of-support/

Getting 3.x out has taken longer than I had anticipated. Should we
reopen the 17 vs 11 discussion given Eric's input? Or do we continue
with the plan to target 11 in 3x for the foreseeable future?

On Mon, Apr 8, 2024 at 9:22 AM Eric Pugh
 wrote:
>
> Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP 
> is already there….Java 11 is EOL and has been for a while….
>
> Any other file parsers that are being optimized to take advantage of the 
> newer features that are in recent Java versions that we know about?
>
> > On Apr 8, 2024, at 7:02 AM, Tim Allison  wrote:
> >
> > Sorry, more correctly:
> >
> > OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> > requires Java 17 and our 3.x is still on 11.
> >
> > On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
> >>
> >> All,
> >>  As Brian pointed out, optimaize is no longer maintained, and it has
> >> some dependencies that have aged out. Should we replace our baseline
> >> langdetect in tika-app and tika-server in 3.x?
> >>  I'd say that we should go with our OpenNLP based language detection,
> >> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> >> Java 17.
> >>  Thoughts?
> >>
> >>Best,
> >>
> >>Tim
> >>
> >> -- Forwarded message -
> >> From: Brian Laskey 
> >> Date: Fri, Mar 8, 2024 at 2:38 PM
> >> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> >> and parsers
> >> To: u...@tika.apache.org 
> >>
> >>
> >> Hi Tim
> >>
> >>
> >>
> >> Thanks this is helpful.
> >>
> >>
> >>
> >> For tika-app we found the dependency on org.apache.tika »
> >> tika-langdetect-optimaize brings in some older 3rd party jars, and
> >> unfortunately it appears that the com.optimaize.languagedetector »
> >> language-detector 0.6 is unmaintained so it’s dependencies on
> >> vulnerable versions of guava (18.0) cause us problems with security
> >> scans. I could be wrong but I don’t believe we need this component for
> >> our usage of just detect and parse?
> >>
> >>
> >>
> >> We have a sort of microservice process (java based) which is ingesting
> >> files parsed from tika. It was nice that we could separate the tika
> >> process in it’s own heap space as a separate java process rather than
> >> adding it to our app, but I suppose we could work around that
> >>
> >>
> >>
> >> Thank you
> >>
> >> Brian Laskey
> >>
> >>
> >>
> >> From: Tim Allison 
> >> Reply-To: "u...@tika.apache.org" 
> >> Date: Friday, March 8, 2024 at 9:44 AM
> >> To: "u...@tika.apache.org" 
> >> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> >> tiki-core / and parsers
> >>
> >>
> >>
> >> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> >> tika-parsers-standard-package. Which components are you trying to
> >> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> >> some of its dependencies? I ask, because we
> >>
> >> Hi Brian,
> >>
> >>  A few thoughts:
> >>
> >>
> >>
> >> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> >> Which components are you trying to avoid? tika-serialization and
> >> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> >> because we could factor out a tika-app-core with no parsers in Tika
> >> 3.x, which is what we do now with tika-server-core and
> >> tika-server-standard.
> >>
> >>
> >>
> >> 2) Unrelated, there are probably more efficient ways of running Tika
> >> than calling it per file on the commandline. That is a robust option,
> >> at least!
> >>
> >>
> >>
> >> If all you want is detect and text extraction, and you want to run it
> >> from the commandline, write two classes, whose main()s call:
> >>
> >> System.out.println(Tika.detect(File f));
> >>
> >>
> >>
> >> or
> >>
> >>
> >>
> >> System.out.println(Tika.parseToString(File f))
> >>
> >>
> >>
> >> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
> >>
> >> Hello Tika community,
> >>
> >>
> >>
> >> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> >> to something with more minimal third party dependencies which we can
> >> control.
> >>
> >>
> >>
> >> Is there any good documentation or pathway to describe how a team
> >> could map the tika-app functionality we use to the same behavior using
> >> just tika-core and tika-parsers-standard-package
> >>
> >> (I assume)?
> >>
> >>
> >>
> >> The tika-app functions we use today are:
> >>
> >>
> >>
> >> Mime-type detection
> >>
> >> java -jar tika-app.jar -d 
> >>
> >>
> >>
> >> and
> >>
> >> Text extraction attempts
> >>
> >> java -jar tika-app.jar -t 
> >>
> >>
> >>
> >> Is there a subset of tika parser jars we would need to include to have
> >> equivalent functionality if we wrote our own wrapper main class?
> >>
> >>
> >>
> >> Thank you,
> >>
> >> Brian Laskey
>
> ___
> Eric Pugh | Founder | OpenSource Connections, LLC | 434.4

Document chunking

2024-04-08 Thread Tim Allison

Not sure we should jump on the bandwagon, but anything we can do to support
smart chunking would benefit us.

Could just be more integrations with parsers that turn out to be useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse

Re: Document chunking

2024-04-08 Thread Michael Wechner

It would be great to have good "semantic chunking" in order to generate 
vector embeddings.


Thanks for the link below, will try to test it.

Thanks

Michael



Am 08.04.24 um 18:29 schrieb Tim Allison:

Not sure we should jump on the bandwagon, but anything we can do to support
smart chunking would benefit us.

Could just be more integrations with parsers that turn out to be useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse

Re: Document chunking

2024-04-08 Thread Nicholas DiPiazza

I am also very interested in this vector-based search. Indexes are a big
thing right now.

On Mon, Apr 8, 2024, 4:16 PM Michael Wechner 
wrote:

> It would be great to have good "semantic chunking" in order to generate
> vector embeddings.
>
> Thanks for the link below, will try to test it.
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.24 um 18:29 schrieb Tim Allison:
> > Not sure we should jump on the bandwagon, but anything we can do to
> support
> > smart chunking would benefit us.
> >
> > Could just be more integrations with parsers that turn out to be useful.
> I
> > haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
> > https://github.com/Filimoa/open-parse
> >
>
>

Re: [PR] Support for adding custom tika configuration [tika-helm]

2024-04-08 Thread via GitHub



lewismc merged PR #15:
URL: https://github.com/apache/tika-helm/pull/15


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Support for adding custom tika configuration [tika-helm]

2024-04-08 Thread via GitHub



lewismc commented on PR #15:
URL: https://github.com/apache/tika-helm/pull/15#issuecomment-2043768368

   Thank you @ahilmathew really nice patch. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-04-08 Thread Lewis John McGibbney (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835077#comment-17835077
 ] 

Lewis John McGibbney commented on TIKA-4232:


It turns out that the original GitHub action I wanted to use will  not be 
approved to use. 

I’m therefore investigating running the tests via the 
[https://github.com/marketplace/actions/docker-run-action] to run the 
{{{}helmunittest/helm-unittest Docker image{}}},  and generate the junit report 
and then using the [https://github.com/marketplace/actions/junit-report-action] 
to report the tests to the PR. 

 

I’ll do further investigation and followup here. 

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Document chunking

2024-04-08 Thread Nick Burch


On Mon, 8 Apr 2024, Tim Allison wrote:
Not sure we should jump on the bandwagon, but anything we can do to 
support smart chunking would benefit us.


Could just be more integrations with parsers that turn out to be useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse


I played around with chunking a bit late last year, but owing to not 
getting any of the AI jobs I went for, I didn't get it beyond a rough 
protype. I can say that most people are doing a terrible job in their 
out-of-the box configs...


My current suggested (but not fully tested) approach is:
 * Define a range of chunk sizes that you'd like (min / ideal / max)
 * Parse as XHTML with Tika
 * Keep track of headings and table headers
 * Break on headings
 * If a chunk is too big, break on other elements (eg div or p)
 * If a chunk is too small, and near other small chunks, join them
 * Include 1-2 headings above the current one at the top,
   as a targetted bit of Table of Contents. (eg chunk starts on H3, put
   the H2 in as well)
 * If you broke up a huge table, repeat the table headers at the
   start of every chunk
 * When you're done chunking + adding bits back at the top, convert
   to markdown on output

Happy to explain more! But sadly lacking time right now to do much on that

Nick

[PR] Bump commons-io:commons-io from 2.16.0 to 2.16.1 [tika]

2024-04-08 Thread via GitHub



dependabot[bot] opened a new pull request, #1716:
URL: https://github.com/apache/tika/pull/1716

   Bumps commons-io:commons-io from 2.16.0 to 2.16.1.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=commons-io:commons-io&package-manager=maven&previous-version=2.16.0&new-version=2.16.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Bump org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0 [tika]

2024-04-08 Thread via GitHub



dependabot[bot] opened a new pull request, #1714:
URL: https://github.com/apache/tika/pull/1714

   Bumps org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.jackrabbit:oak-jackrabbit-api&package-manager=maven&previous-version=1.60.0&new-version=1.62.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Bump aws.version from 1.12.696 to 1.12.697 [tika]

2024-04-08 Thread via GitHub



dependabot[bot] opened a new pull request, #1715:
URL: https://github.com/apache/tika/pull/1715

   Bumps `aws.version` from 1.12.696 to 1.12.697.
   Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.696 to 1.12.697
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>com.amazonaws:aws-java-sdk-s3's
 changelog.
   
   1.12.697 2024-04-08
   AWS Control Catalog
   
   
   Features
   
   This is the initial SDK release for AWS Control Catalog, a central 
catalog for AWS managed controls. This release includes 3 new APIs - 
ListDomains, ListObjectives, and ListCommonControls - that vend high-level data 
to categorize controls across the AWS platform.
   
   
   
   Amazon CloudWatch Network Monitor
   
   
   Features
   
   Updated the allowed monitorName length for CloudWatch Network 
Monitor.
   
   
   
   Application Migration Service
   
   
   Features
   
   Added USE_SOURCE as default option to LaunchConfigurationTemplate 
bootMode parameter.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/e33773e8a43986ff93e2afe1f2c8193f3768bbd4";>e33773e
 AWS SDK for Java 1.12.697
   https://github.com/aws/aws-sdk-java/commit/679abaebd371b09e887afaa5386dc182be4c6498";>679abae
 Update GitHub version number to 1.12.697-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.696...1.12.697";>compare 
view
   
   
   
   
   Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.696 to 1.12.697
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>com.amazonaws:aws-java-sdk-transcribe's
 changelog.
   
   1.12.697 2024-04-08
   AWS Control Catalog
   
   
   Features
   
   This is the initial SDK release for AWS Control Catalog, a central 
catalog for AWS managed controls. This release includes 3 new APIs - 
ListDomains, ListObjectives, and ListCommonControls - that vend high-level data 
to categorize controls across the AWS platform.
   
   
   
   Amazon CloudWatch Network Monitor
   
   
   Features
   
   Updated the allowed monitorName length for CloudWatch Network 
Monitor.
   
   
   
   Application Migration Service
   
   
   Features
   
   Added USE_SOURCE as default option to LaunchConfigurationTemplate 
bootMode parameter.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/e33773e8a43986ff93e2afe1f2c8193f3768bbd4";>e33773e
 AWS SDK for Java 1.12.697
   https://github.com/aws/aws-sdk-java/commit/679abaebd371b09e887afaa5386dc182be4c6498";>679abae
 Update GitHub version number to 1.12.697-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.696...1.12.697";>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Bump aws.version from 1.12.696 to 1.12.697 [tika]

2024-04-08 Thread via GitHub



THausherr merged PR #1715:
URL: https://github.com/apache/tika/pull/1715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Bump commons-io:commons-io from 2.16.0 to 2.16.1 [tika]

2024-04-08 Thread via GitHub



THausherr merged PR #1716:
URL: https://github.com/apache/tika/pull/1716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Bump org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0 [tika]

2024-04-08 Thread via GitHub



THausherr merged PR #1714:
URL: https://github.com/apache/tika/pull/1714


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Tika 3.0.0-BETA2?

Replace baseline language detection in tika-server and tika-app in 3.x?

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

Re: [PR] Tika 4237 add jwt authentication ability to the http fetcher [tika]

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

Document chunking

Re: Document chunking

Re: Document chunking

Re: [PR] Support for adding custom tika configuration [tika-helm]

Re: [PR] Support for adding custom tika configuration [tika-helm]

[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

Re: Document chunking

[PR] Bump commons-io:commons-io from 2.16.0 to 2.16.1 [tika]

[PR] Bump org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0 [tika]

[PR] Bump aws.version from 1.12.696 to 1.12.697 [tika]

Re: [PR] Bump aws.version from 1.12.696 to 1.12.697 [tika]

Re: [PR] Bump commons-io:commons-io from 2.16.0 to 2.16.1 [tika]

Re: [PR] Bump org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0 [tika]

19 matches

Site Navigation

Mail list logo

Footer information