Re: [PR] Bump org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0 [tika]

2024-04-08 Thread via GitHub
THausherr merged PR #1714: URL: https://github.com/apache/tika/pull/1714 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] Bump commons-io:commons-io from 2.16.0 to 2.16.1 [tika]

2024-04-08 Thread via GitHub
THausherr merged PR #1716: URL: https://github.com/apache/tika/pull/1716 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] Bump aws.version from 1.12.696 to 1.12.697 [tika]

2024-04-08 Thread via GitHub
THausherr merged PR #1715: URL: https://github.com/apache/tika/pull/1715 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[PR] Bump aws.version from 1.12.696 to 1.12.697 [tika]

2024-04-08 Thread via GitHub
dependabot[bot] opened a new pull request, #1715: URL: https://github.com/apache/tika/pull/1715 Bumps `aws.version` from 1.12.696 to 1.12.697. Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.696 to 1.12.697 Changelog Sourced from

[PR] Bump org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0 [tika]

2024-04-08 Thread via GitHub
dependabot[bot] opened a new pull request, #1714: URL: https://github.com/apache/tika/pull/1714 Bumps org.apache.jackrabbit:oak-jackrabbit-api from 1.60.0 to 1.62.0. [![Dependabot compatibility

[PR] Bump commons-io:commons-io from 2.16.0 to 2.16.1 [tika]

2024-04-08 Thread via GitHub
dependabot[bot] opened a new pull request, #1716: URL: https://github.com/apache/tika/pull/1716 Bumps commons-io:commons-io from 2.16.0 to 2.16.1. [![Dependabot compatibility

Re: Document chunking

2024-04-08 Thread Nick Burch
On Mon, 8 Apr 2024, Tim Allison wrote: Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated

[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-04-08 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835077#comment-17835077 ] Lewis John McGibbney commented on TIKA-4232: It turns out that the original GitHub action I

Re: [PR] Support for adding custom tika configuration [tika-helm]

2024-04-08 Thread via GitHub
lewismc commented on PR #15: URL: https://github.com/apache/tika-helm/pull/15#issuecomment-2043768368 Thank you @ahilmathew really nice patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Support for adding custom tika configuration [tika-helm]

2024-04-08 Thread via GitHub
lewismc merged PR #15: URL: https://github.com/apache/tika-helm/pull/15 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

Re: Document chunking

2024-04-08 Thread Nicholas DiPiazza
I am also very interested in this vector-based search. Indexes are a big thing right now. On Mon, Apr 8, 2024, 4:16 PM Michael Wechner wrote: > It would be great to have good "semantic chunking" in order to generate > vector embeddings. > > Thanks for the link below, will try to test it. > >

Re: Document chunking

2024-04-08 Thread Michael Wechner
It would be great to have good "semantic chunking" in order to generate vector embeddings. Thanks for the link below, will try to test it. Thanks Michael Am 08.04.24 um 18:29 schrieb Tim Allison: Not sure we should jump on the bandwagon, but anything we can do to support smart chunking

Document chunking

2024-04-08 Thread Tim Allison
Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated yet: https://github.com/Filimoa/open-parse

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
>From October 2023: https://www.brilworks.com/blog/java-11-countdown-to-end-of-support/ Getting 3.x out has taken longer than I had anticipated. Should we reopen the 17 vs 11 discussion given Eric's input? Or do we continue with the plan to target 11 in 3x for the foreseeable future? On Mon, Apr

Re: [PR] Tika 4237 add jwt authentication ability to the http fetcher [tika]

2024-04-08 Thread via GitHub
bartek commented on code in PR #1712: URL: https://github.com/apache/tika/pull/1712#discussion_r1555919713 ## tika-pipes/tika-fetchers/tika-fetcher-http/src/main/java/org/apache/tika/pipes/fetcher/http/jwt/JwtGenerator.java: ## @@ -0,0 +1,64 @@ +package

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Eric Pugh
Time to move on? Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is already there….Java 11 is EOL and has been for a while…. Any other file parsers that are being optimized to take advantage of the newer features that are in recent Java versions that we know about? > On

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
Sorry, more correctly: OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0 requires Java 17 and our 3.x is still on 11. On Mon, Apr 8, 2024 at 6:30 AM Tim Allison wrote: > > All, > As Brian pointed out, optimaize is no longer maintained, and it has > some dependencies that have

Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
All, As Brian pointed out, optimaize is no longer maintained, and it has some dependencies that have aged out. Should we replace our baseline langdetect in tika-app and tika-server in 3.x? I'd say that we should go with our OpenNLP based language detection, but that, too, is effectively EOL'd

Tika 3.0.0-BETA2?

2024-04-08 Thread Tim Allison
All, I'm now thinking it would make sense to have one more 3.x beta release before the final 3.0.0. Are there any breaking changes that we want to get into 3.x? I'd like to wait for COMPRESS-675 to be fixed and for COMPRESS-674 to be released before we release 3.0.0-BETA2. Any other items that

Re: [PR] Bump jakarta.annotation:jakarta.annotation-api from 3.0.0-M1 to 3.0.0 [tika]

2024-04-08 Thread via GitHub
THausherr merged PR #1713: URL: https://github.com/apache/tika/pull/1713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: