Re: XPS files not emitting whitespace

2024-09-25 Thread Tim Allison
Thank you for raising this issue. Please re-request a jira account, and we'll accept it. Sorry about that. On Wed, Sep 25, 2024 at 11:06 AM Ruairidh Williamson < ruairidh.william...@nextdlp.com> wrote: > Hello, > > We are using tika to extract text from XPS files and have hit an issue > where whi

Re: Filter within zip file

2024-09-14 Thread Tim Allison
ve to integrate it. :D On Sat, Sep 14, 2024 at 6:58 AM Tim Allison wrote: > Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J > -t myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the > grpc server will effectively give that output. > > On

Re: Filter within zip file

2024-09-14 Thread Tim Allison
Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J -t myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the grpc server will effectively give that output. On Fri, Sep 13, 2024 at 10:01 AM David Pilato wrote: > Hey team, > > > I'm wondering if there is a wa

Re: Does Tika support detecting labels in an image?

2024-07-30 Thread Tim Allison
I agree with Tilman. If there's a more modern package/model you'd want to use whether server based or commandline, it is fairly straightforward to add a new parser to handle your needs. On Fri, Jul 26, 2024 at 11:50 PM Tilman Hausherr wrote: > I don't think so, the closest we have is DL4J but I

[ANNOUNCE] Apache Tika 3.0.0-BETA2 released

2024-07-16 Thread Tim Allison
0.0. -- Tim Allison, on behalf of the Apache Tika community

Re: [RESULT][VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-15 Thread Tim Allison
I released the artifacts and built the docker images. I'll work on the site and announcement tomorrow. On Mon, Jul 15, 2024 at 1:50 PM Tim Allison wrote: > > The vote has passed with 3 PMC +1s, 2 non-binding +1s and no -1s. > > +1s (binding) > Tim Allison > Nicholas DiP

[RESULT][VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-15 Thread Tim Allison
The vote has passed with 3 PMC +1s, 2 non-binding +1s and no -1s. +1s (binding) Tim Allison Nicholas DiPiazza Tilman Hausherr +1s (non-binding) Kiran Bachu Gary Gregory I'll release the artifacts shortly and update the website. Thank you, all! Best, Tim On Fri, Jul 12, 2024

[VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-12 Thread Tim Allison
A candidate for the Tika 3.0.0-BETA2 release is available at: https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/3.0.0-BETA2-rc1/ The SHA-512 checksum of the archive is 8a4142f61110f196c550146637994

Re: Correlating the output of rmeta and unpack

2024-06-17 Thread Tim Allison
I regret that those endpoints do not have a reliable way to link them. I recently integrated something that does work, but it requires the tika-pipes framework, which you can use via tika-server. It will output .json files and a subdirectory of binary files, and there is a key in the json file th

Re: Issues installing standalone server on Ubuntu 22.04

2024-06-11 Thread Tim Allison
I regret that we haven't had contributions on the tika as a service scripts since 1.x. We could really use help. On Tue, Jun 11, 2024 at 3:37 AM JB Data31 wrote: > > No real explanation of this problem, but indeed the service is not really > installed. > > > *$ service --status-all | grep tika$*

Re: Script tag contents not always reported in ContentHandler

2024-05-30 Thread Tim Allison
Markus, I'm sorry for my delay. We're migrating to jsoup in 3.x. I realize that 3.x isn't out yet, but I wanted to give you a heads up. To extract scripts in 3.x, you'd do something like this: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-mod

Re: Setting limits on text extraction for compressed files with Tika Server

2024-05-24 Thread Tim Allison
I'm not sure which endpoint you're using, but search for "writeLimit" on this page: https://cwiki.apache.org/confluence/display/TIKA/TikaServer As you probably know, many file formats are actually compressed: PDF, docx, etc. There is no way to know ahead of time for many file formats what the amou

multi-arch support for tika-docker!

2024-05-21 Thread Tim Allison
All, Many thanks to the many community members who helped figure this out and get it out the door! As of tika-docker 2.9.2.1, we now have multi-arch support (and on noble!). Let us know if there are any surprises. Thank you, again! Cheers, Tim Ref: https://hub.docker.com/r

Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-29 Thread Tim Allison
during the parse in the above section. On Mon, Apr 29, 2024 at 10:28 AM Tim Allison wrote: > I agree with Nick. > > You can better understand the magic based algorithms we're using for > detection by searching for mp4 and quicktime in this file: > https://github.com/apache

Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-29 Thread Tim Allison
I agree with Nick. You can better understand the magic based algorithms we're using for detection by searching for mp4 and quicktime in this file: https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml A middle ground is to have the MP4 parse

Re: Extracting XML comments

2024-04-18 Thread Tim Allison
HI Claude, I'd recommend a custom XMLParser for this, perhaps subclass DcXMLParser? We could also parameterize this in the DcXMLParser if a committer had a chance to add that feature or review a PR from yoou. Best, Tim On Thu, Apr 18, 2024 at 7:33 AM Claude Warren wrote: > It seems th

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

2024-04-04 Thread Tim Allison
I'm on ubuntu. That's the 2.7.0 pom, obv. I just bumped the versions, reloaded and ran to see different numbers of parsers in 2.7.0 vs 2.8.0+2.9.0. On Thu, Apr 4, 2024 at 8:20 AM Tim Allison wrote: > > I'm attaching the pom. I can't remember if attachments get strippe

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

2024-04-04 Thread Tim Allison
pdf.PDFParser.PASSWORD); on both > 2.7.0 and 2.8.0+ > > Thanks, and regards, > Gerardo > > From: Tim Allison > Sent: Wednesday, April 3, 2024 06:43 AM > To: user@tika.apache.org > Subject: Re: AutoDetectParser not working after upgrading f

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

2024-04-03 Thread Tim Allison
Y, I'm not able to repro this problem with 2.8.0 or higher. I'm seeing 239 parsers (probably diff from Tilman because of installed external parsers?). On Wed, Apr 3, 2024 at 5:09 AM Tilman Hausherr wrote: > > On 03.04.2024 08:55, Gerardo Hernandez wrote: > > On 2.7.0, I get a list of 203 parsers,

Re: [ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
I also released our docker images for 2.9.2.0. How do we update helm? On Tue, Apr 2, 2024 at 2:31 PM Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > Tika 2.9.2. The release contents have been pushed out to the main > Apache release sit

[ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

[RESULT][VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-04-02 Thread Tim Allison
The vote has passed with 3 PMC +1s and no -1s. +1s Oleg Tikhonov Tilman Hausherr Tim Allison I'll release the artifacts shortly and update the website. Thank you, all! Best, Tim On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov wrote: > +1, > Thanks. > > On Mon, 1

Re: 2.9.1 release?

2024-04-01 Thread Tim Allison
n, 16 Oct 2023 at 11:46, Tom Conlon wrote: > >> Hi, >> Would it be possible for the issue "Fix tika as a service" >> https://issues.apache.org/jira/browse/TIKA-4152 >> to be reviewed before release? >> >> Thanks >> Tom >> >> On M

Re: tika-helm now on artifacthub.io

2024-03-30 Thread Tim Allison
W00t! Thank you Lewis! On Sat, Mar 30, 2024 at 3:57 PM lewis john mcgibbney wrote: > Hi user@, dev@, > > For those running Tika on Kubernetes, you can now conveniently find the > Helm Chart via artifacthub.io > > https://artifacthub.io/packages/helm/apache-tika/tika > > I’ll build in a little mo

[VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-03-26 Thread Tim Allison
A candidate for the Tika 2.9.2 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.9.2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.9.2-rc2/ The SHA-512 checksum of the archive is 5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e

Re: Meta output format of tika server /unpack/all

2024-03-21 Thread Tim Allison
embedded raw bytes and the rmeta content. Not sure what to call that endpoint. Recommendations? On Thu, Mar 21, 2024 at 6:10 PM Tim Allison wrote: > If rmeta/text is not returning text extracted from embedded files that’s a > bug. > > I don’t think /rmeta/all is a thing. > > On Th

Re: Meta output format of tika server /unpack/all

2024-03-21 Thread Tim Allison
If rmeta/text is not returning text extracted from embedded files that’s a bug. I don’t think /rmeta/all is a thing. On Thu, Mar 21, 2024 at 5:21 PM Zig Zag wrote: > Thanks Josh, thats correct but rmeta/text allows you to control this but > it only returns one level of text (not documents embed

Re: About Tika 2.9.2 release date

2024-03-21 Thread Tim Allison
Doh! 'Tis the season. PDFBox has started their release cycle. Let's wait for PDFBox 2.0.31. On Thu, Mar 21, 2024 at 11:54 AM Tim Allison wrote: > All, > > I kicked off the regression tests for 2.9.2. I'm aiming for tomorrow for > an rc1. Again, let me know if ther

Re: About Tika 2.9.2 release date

2024-03-21 Thread Tim Allison
All, I kicked off the regression tests for 2.9.2. I'm aiming for tomorrow for an rc1. Again, let me know if there are any blockers or other things we need to get into 2.9.2. Thank you! Best, Tim On Wed, Mar 20, 2024 at 2:00 PM Tim Allison wrote: > Fellow devs and community,

Re: Running tika-server and I need this check to NOT happen

2024-03-21 Thread Tim Allison
efault" tika configs just to get the version number in tika-server. That issue fixes that problem. This fix may improve the loading speed of tika-server, too. :D On Wed, Mar 20, 2024 at 3:54 PM Tim Allison wrote: > Looking at TikaConfig, it looks like the "excluded" parsers are

Re: Tika-parser not able to parse specific content

2024-03-21 Thread Tim Allison
If you know that you're only parsing text files, you could configure only the TextOrCSVParser and specify that it processes "application/octet". This should force every file to be processed by that parser. Something like this? application/octet-stream Or you could tell tik

Re: Memory exception

2024-03-21 Thread Tim Allison
Hi Chetan, Need more info... An eml file contains 465MB of XML and an MP4? How big is the mp4? Are you getting the same behavior with {{java -jar tika-app.jar -J -t big.eml}}? From the stacktrace, can you tell where the final straw is in memory allocation? Are you able to share the file with me of

Re: Running tika-server and I need this check to NOT happen

2024-03-21 Thread Tim Allison
We should also fix this: https://issues.apache.org/jira/browse/TIKA-4216 On Thu, Mar 21, 2024 at 9:13 AM Tim Allison wrote: > This is one problem: https://issues.apache.org/jira/browse/TIKA-4215 > > On Wed, Mar 20, 2024 at 3:25 PM Josh Burchard > wrote: > >> Hi all, >

Re: Running tika-server and I need this check to NOT happen

2024-03-21 Thread Tim Allison
This is one problem: https://issues.apache.org/jira/browse/TIKA-4215 On Wed, Mar 20, 2024 at 3:25 PM Josh Burchard wrote: > Hi all, > > I've got Tika 2.9.1 server running on Linux and Tika is checking for the > presence of ImageMagick. I tried disabling the TesseractOCR parser in my > xml config

Re: Running tika-server and I need this check to NOT happen

2024-03-20 Thread Tim Allison
Looking at TikaConfig, it looks like the "excluded" parsers are actually loaded and initialized, but they are not added to the composite parser if they're on the exclude list. We should try to avoid loading them at all if they are excluded. IIRC, this is a bit complex in TikaConfig. Let me take a

Re: Tika-parser not able to parse specific content

2024-03-20 Thread Tim Allison
I'm wondering if we can tighten the detection to include a newline after the P2, etc. It looks like we require a new line for some of those file format variants. Let me do some research, unless anyone happens to know. On Mon, Mar 18, 2024 at 4:40 PM Kashif Khan wrote: > Hi, > I tried configurin

Re: About Tika 2.9.2 release date

2024-03-20 Thread Tim Allison
Fellow devs and community, I'd like to fix TIKA-4211 before the next release. It has been a while since our last 2.x release. What do you think about aiming for starting the voting process early next week? Any other blockers? On Tue, Mar 19, 2024 at 7:49 PM Shu Peng wrote: > Dear Tika Team, >

Re: Questions on rmeta and pipes

2024-03-12 Thread Tim Allison
ank you!. Will definitely provide feedback. > > While this get into 3.0 officially is there something I can prototype with > /rmeta to help me get my other stuff working - any suggestions on approach > or a draft PR for the official feature would be very helpful > > On Tue, Mar 12

Re: Questions on rmeta and pipes

2024-03-12 Thread Tim Allison
Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207 I think I'll be wiring this into the /pipes and /async endpoints. The json request will specify that you want bytes AND text+metadata. There will be two options: a) you specify two emitters: one for json and one for raw byte

Re: Replacing full tika-app.jar to directly using tiki-core / and parsers

2024-03-08 Thread Tim Allison
Hi Brian, A few thoughts: 1) tika-app is basically tika-core + tika-parsers-standard-package. Which components are you trying to avoid? tika-serialization and jackson? boilerpipecontenthandler and some of its dependencies? I ask, because we could factor out a tika-app-core with no parsers in Tik

Re: Question about error when upgradking to 2.9.1 from 2.2.0

2024-02-08 Thread Tim Allison
How are you managing dependencies? Are you bringing in tika-app? Are you using tika-parsers-standard-package? I agree with Tilman that there's likely an older version of pdfbox on your classpath. On Thu, Feb 8, 2024 at 10:22 AM Tilman Hausherr wrote: > No, that one is fine. Could it be that a l

Re: Replacing tika server default parser with my custom default parser

2024-02-05 Thread Tim Allison
. :D Best, Tim [0] https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters#ModifyingContentWithHandlersAndMetadataFilters-4.AutoDetectParserConfig On Sun, Feb 4, 2024 at 2:27 PM Tim Allison wrote: > W00t! Let us know when you have m

Re: Replacing tika server default parser with my custom default parser

2024-02-04 Thread Tim Allison
W00t! Let us know when you have more questions! On Fri, Feb 2, 2024 at 10:27 AM Slava G wrote: > Thanks a lot !! > > On Thu, Feb 1, 2024 at 6:36 PM Tim Allison wrote: > >> https://tika.apache.org/3.0.0-BETA/parser_guide.html >> >> You should be able to add your

Re: Replacing tika server default parser with my custom default parser

2024-02-01 Thread Tim Allison
https://tika.apache.org/3.0.0-BETA/parser_guide.html You should be able to add your parser in a services file, and the way the class loading sorting works, non-tika parsers should have a higher priority automatically. If that doesn't work, we can update the documentation to show what that would lo

3.0.0 release and deprecation planning for 2.x

2024-01-31 Thread Tim Allison
All, I'd like to run a final html eval comparing Tika 2.x (tagsoup) and 3.x (jsoup) on TIKA-4185. Other than that, are there any blockers or things that we need to get into 3.0.0 before we make the first release? If there aren't any blockers, I can aim for a 3.0.0-rc1 probably towards the beginni

Re: Parser removes file content and treats it as Metadata

2024-01-25 Thread Tim Allison
Content-Type":"message/rfc822"}] On Thu, Jan 25, 2024 at 2:11 PM Gerardo Hernandez wrote: > Hi Ken, > > Unfortunately enforcing Tika to use TXTParser does not solve our problem > at all, I mean it would work for very simple emails, but we also want to be > able to par

Re: Issue with HttpFetcher in Tika - URISyntaxException due to Unencoded Characters in URL

2024-01-04 Thread Tim Allison
That's not good. Thank you for sharing this with us: https://issues.apache.org/jira/browse/TIKA-4178 On Fri, Dec 22, 2023 at 11:18 AM João Domingues wrote: > Dear Tika Team, > > I am writing to report an issue encountered while using Apache Tika's > HttpFetcher functionality, specifically when

Re: HTML Parsing Changes in 3.0.0-BETA

2024-01-02 Thread Tim Allison
Please, please let us know of any other problems you find! And, thank you, again. On Tue, Jan 2, 2024 at 8:58 AM Tim Allison wrote: > > yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522 > > Once the checks pass, I'll merge that, and we should be go

Re: HTML Parsing Changes in 3.0.0-BETA

2024-01-02 Thread Tim Allison
yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522 Once the checks pass, I'll merge that, and we should be good to go. Thank you so much for letting us know of this bug. On Mon, Dec 18, 2023 at 7:57 AM Andreas Hubold wrote: > > Hi, > > I'm currently testing the upgrade

Re: [ANNOUNCE] Apache Tika 3.0.0-BETA released

2023-12-15 Thread Tim Allison
-parser-image-module and tika-parser-pdf-module? > > Cheers, > Stephen. > > On 13/12/2023 14:40, Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > > Tika 3.0.0-BETA. The release contents have been pushed out to the main > > Apa

[ANNOUNCE] Apache Tika 3.0.0-BETA released

2023-12-13 Thread Tim Allison
ted CVEs: CVE-2023-6481/CVE-2023-6378. NOTE: This release requires Java 11. We plan to support the 2.x branch (which requires Java 8) for six months after the release of 3.0.0. -- Tim Allison, on behalf of the Apache Tika community

[RESULT][VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-13 Thread Tim Allison
The vote has passed with three PMC +1s and no -1s. +1s Konstantin Gribov Tilman Hausherr Tim Allison Thank you all. I'll try to push the artifacts and update the website shortly. Best, Tim On Mon, Dec 11, 2023 at 3:45 PM Tim Allison wrote: > Thank you, Konstantin! > >

Re: [VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-11 Thread Tim Allison
nt-type=maven&component-name=ch.qos.logback%2Flogback-core&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 > [2]: https://logback.qos.ch/news.html#1.3.14 > [3]: https://logback.qos.ch/manual/receivers.html > > -- > Best regards, > Konstantin Gribov.

Re: [VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-11 Thread Tim Allison
All, We have two +1s. We need another +1 for the release. If a fellow dev has the time to vote, please do! Thank you. Best, Tim On Wed, Dec 6, 2023 at 3:17 PM Tim Allison wrote: > Oops, I forgot to include my +1 for this RC1 for 3.0.0-BETA. Would another > fellow dev be will

Re: tika-server too many open files

2023-12-07 Thread Tim Allison
What version of Tika? Are you running it in Docker or uncontained? On Wed, Dec 6, 2023 at 12:31 PM Mark Kerzner SHMsoft, Inc. < mark.kerz...@shmsoft.com> wrote: > Hi, > > I get this error: > > /tmp/apache-tika-server-forked-tmp-13502607376096852844: Too many open > files > > Can you please help?

Re: [VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-06 Thread Tim Allison
man > > > > On 01.12.2023 18:25, Tim Allison wrote: > > A candidate for the Tika 3.0.0-BETA release is available at: > > https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA > > > > The release candidate is a zip archive of the sources in: > > https://gi

[VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-01 Thread Tim Allison
A candidate for the Tika 3.0.0-BETA release is available at: https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/3.0.0-BETA-rc1/ The SHA-512 checksum of the archive is 6a98e19f73e0ccf9c902cf869fb50c0c

Re: [EXTERNAL] Re: Tika parser not parsing email content

2023-11-06 Thread Tim Allison
/browse/TIKA-4153. > How do you think I can proceed with the parsing of the document, is there > a latest version I can download? Where exactly I can find this version to > download? > > Thanks, > Kashif > > > On Wed, Oct 11, 2023 at 1:46 AM Tim Allison wrote: > >

Re: How to process metadata returned by the tika server?

2023-11-03 Thread Tim Allison
A heavier-weight option is to use the tika-serialization module (which uses Jackson databind) and do something like this: https://github.com/apache/tika/blob/main/tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/RecursiveMetadataResourceTest.java#L89 On Fri, Nov 3, 2

Re: Tika OOM issue

2023-10-26 Thread Tim Allison
Request limits can fairly easily be implemented outside of Tika, but > resource isolation is not, so having a solution for that as well would be > very nice. > Isolation as in pipes? One file per forked jvm at a time? > This SO (https://stackoverflow.com/questions/52495429/setting-xxmaxram) >> >>

Virtual Apache Tika meetup in celebration of World Digital Preservation Day

2023-10-26 Thread Tim Allison
I'm throwing an intro to Tika hands-on workshop on Nov 2 to celebrate World Digital Preservation Day: https://www.meetup.com/apache-tika-community/events/296969821/ Please let me know if anyone on the list would like to organize or speak at a Tika-based meetup going forward. Cheers, Tim

Re: Tika OOM issue

2023-10-26 Thread Tim Allison
>return 503 if > requests Oops, 429, I'd guess? On Thu, Oct 26, 2023 at 9:33 AM Tim Allison wrote: > > > The flags -XX:MaxRAMPercentage or -XX:MaxRAM should do the trick, but am > > investigating if they are enforced fast enough before the system OOM kicks > >

Re: Tika OOM issue

2023-10-26 Thread Tim Allison
> The flags -XX:MaxRAMPercentage or -XX:MaxRAM should do the trick, but am > investigating if they are enforced fast enough before the system OOM kicks > in. So far I would say that is not the case. This SO (https://stackoverflow.com/questions/52495429/setting-xxmaxram) suggests that maxRAM is

Re: Tika OOM issue

2023-10-24 Thread Tim Allison
Sorry for my delay. > My preliminary conclusion is that the jvm is not able to enforce these flags > 100% of the time quickly enough before the cgroup limits kick in and the > kernel oom kicks in. Did anyone else experience this., Y, that's my guess as well. There's a chance that some parsers

Re: [ANNOUNCE] Apache Tika 2.9.1 released

2023-10-20 Thread Tim Allison
I released the docker image, just now, too. Lewis or anyone else who has knowledge + time to release the helm chart, go forth! On Fri, Oct 20, 2023 at 2:38 PM Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > Tika 2.9.1. The release contents

[ANNOUNCE] Apache Tika 2.9.1 released

2023-10-20 Thread Tim Allison
, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

Re: Release timing for 2.9.1 and 3.0.0-beta?

2023-10-20 Thread Tim Allison
Thank you, Keith. On Tue, Oct 17, 2023 at 11:17 PM Keith Bennett wrote: > Hi, Tim. You have given so, so much to this project. As far as I'm > concerned, you *never* need to say you're sorry. ;) > > - Keith > > > On Oct 17, 2023, at 2:59 AM, Tim Allison wrote:

[RESULT][VOTE] Release Apache Tika 2.9.1 Candidate #1

2023-10-20 Thread Tim Allison
The vote has passed with 3 +1s and no -1s. +1s Tim Allison Tilman Hausherr Oleg Tikhonov Thank you, all! I'll release the artifacts and update the website shortly. On Thu, Oct 19, 2023 at 6:52 AM Tim Allison wrote: > My belated +1. Reports are here: > https://corpora.tika.apac

[VOTE] Release Apache Tika 2.9.1 Candidate #1

2023-10-17 Thread Tim Allison
A candidate for the Tika 2.9.1 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.9.1 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.9.1-rc1 The SHA-512 checksum of the archive is ba13a0d22994ca84cccd9ad2931e099051870d46a5a34402

Release timing for 2.9.1 and 3.0.0-beta?

2023-10-16 Thread Tim Allison
All, We detected and fixed an area for improvement in the version of POI that we just upgraded to (https://bz.apache.org/bugzilla/show_bug.cgi?id=67767). I should have caught this in earlier regression tests before the release of POI, but I clearly botched that comparison run. I'm sorry. My g

Re: 2.9.1 release?

2023-10-16 Thread Tim Allison
at, Oct 14, 2023 at 7:16 AM Tim Allison wrote: > Looks like we have a bunch of new > "org.apache.poi.util.RecordFormatException: Tried to allocate an array of > length 10,xxx,xxx, but the maximum length for this record type is >

Re: 2.9.1 release?

2023-10-16 Thread Tim Allison
hat value. Also "error" is now nothing. > > Tilman > > On 14.10.2023 13:16, Tim Allison wrote: > > Looks like we have a bunch of new > "org.apache.poi.util.RecordFormatException: Tried to allocate an array of &g

Re: 2.9.1 release?

2023-10-14 Thread Tim Allison
the regression tests didn't pick this up. The changes in rfc822 detection have also had some effects. The few handfuls that I've reviewed are actually positive changes. I'll review systematically on Monday. On Sat, Oct 14, 2023 at 6:35 AM Tim Allison wrote

Re: 2.9.1 release?

2023-10-14 Thread Tim Allison
Reports are here: https://corpora.tika.apache.org/base/reports/tika-2.9.1-reports.tgz I haven't had a chance to look at them yet. :( Will take a look early Monday (ET). On Wed, Oct 11, 2023 at 10:24 AM Tim Allison wrote: > Unless there are objections, I'll kick off the 2.9.1 reg

Re: Converting Tika 1.x (as library) to Tika 2.x (as server) with custom classes

2023-10-13 Thread Tim Allison
The custom parsers and contenthandler can be configured via Tika-config. We don’t yet have a way to configure AbstractRecursive… or the DocumentSelector. Note that 3.x beta should be out soon. Aside from requiring Java 11, there aren’t big changes in 3.x. I’ll dig up examples when I’m back to a

Re: [EXTERNAL] 2.9.1 release?

2023-10-11 Thread Tim Allison
; > > > - Original message - > From: "Tim Allison" > To: "Tika User" , "" < > d...@tika.apache.org> > Cc: > Subject: [EXTERNAL] 2.9.1 release? > Date: Wed, Oct 11, 2023 10:25 AM > > Unless there are objections, I'll k

2.9.1 release?

2023-10-11 Thread Tim Allison
Unless there are objections, I'll kick off the 2.9.1 regression tests shortly. I just cherry-picked TIKA-4153 into 2.x...will be interesting to see how that works. Best, Tim On Tue, Oct 10, 2023 at 1:37 PM Tim Allison wrote: > All, > Nandita's email didn'

Re: [EXTERNAL] Re: Tika parser not parsing email content

2023-10-10 Thread Tim Allison
> this loss of fidelity when we index their file attachments. Is there a Jira > item where I can read about the reason behind its current implementation? > -Josh/HCL > > > > > From:"Tim Allison" > To:user@tika.apache.org > Date:10/10

Requesting Tika Server release: commons-compress vulnerability

2023-10-10 Thread Tim Allison
ies/GHSA-cgwf-w82q-5jrr> This is due to use of Tika Server 2.9.0 (Apache Tika – Apache Tika 1.27 <https://tika.apache.org/2.9.0/index.html>), which has commons-compress as a dependency. I saw that Tim Allison recently updated this* commons-compress* version in the Github mirror repo

Re: Tika parser not parsing email content

2023-10-10 Thread Tim Allison
I can confirm this is still happening in our main/3.x branch. As you probably guessed, the issue is that the file is identified as an email and then parsed as if it were one. If you know that all you have are plain text files, you might consider using the TextAndCSVParser or just the TXTParser. O

Re: Error after intall: "Failed to start LSB: Controls Apache Tika as a Service"

2023-10-06 Thread Tim Allison
I opened: https://issues.apache.org/jira/browse/TIKA-4152 to track this. On Thu, Oct 5, 2023 at 5:08 PM Tom Conlon wrote: > Forgot to add that on the same machine Solr installs and runs fine > Started LSB: Controls Apache Solr as a Service. > > plus have installed tika ok in the past but the las

Re: Error after intall: "Failed to start LSB: Controls Apache Tika as a Service"

2023-10-06 Thread Tim Allison
I regret I haven't had a chance to look at this. We got a similar email a month ago: https://lists.apache.org/thread/mnf3pxlmvdy456v4s2b8r7mv3khl3msk Which versions of Tika last worked for you? Was it a 2.x, or did we break something in the 2.x branch? On Thu, Oct 5, 2023 at 5:08 PM Tom Conlon

Re: Tika server mode - /async task tracking

2023-10-05 Thread Tim Allison
Will definitely explore > the Pipes reporter and the unit tests, if I have something worthy > documenting in the end I will give you a shout on here. > > Many thanks, > Georgi > > On Tue, 3 Oct 2023, 20:38 Tim Allison, wrote: > >> I'm sorry for my delay. >> >

Re: Tika server mode - /async task tracking

2023-10-03 Thread Tim Allison
I'm sorry for my delay. At some point, I was thinking about implementing: /async/ but I gave up. The problem was that I didn't want to have to tie caching/storing status info into tika-server or the async processor -- so I created a configurable PipesReporter class...see below. If you set up log

Re: [External] Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-14 Thread Tim Allison
Y, totally get it. How about shortening the EOL of Tika 2.x (and Java 8) to 6 months after the Tika 3.x/Java 11 release? On Thu, Sep 14, 2023 at 5:41 AM Sandeep Kulkarni wrote: > As a long time user of Tika, I would like to suggest Java 11 should be > supported for 3.x. Java 17 is still quite n

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Tim Allison
We seem to have consensus on Java 11 for 3.x and keep Java 8 for 2.x for one more year. I've started the branches and started making some changes in this direction. Is it worth pushing this modernization further or faster, with either: a) Jump to Java 17 now and keep Java 8 in 2.x for one more ye

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Tim Allison
ote: >> >> +1 from our side, we moved to java 11 last year. >> >> Best, >> Luis >> >> >> Em ter, 12 de set de 2023 19:01, Ken Krugler >> escreveu: >> >>> +1 >>> >>> On Sep 12, 2023, at 7:56 AM, Tim Allison wrote

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Tim Allison
fixes only" 6 months after the first release of 3.0.0? In 3.x, we'd require Java 11 and jakarta. We wouldn't make many other major changes. On Tue, Sep 12, 2023 at 10:49 AM Tim Allison wrote: > >If Tika users will be happy to move on and drop Java 8 and/or javax. > Ple

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Tim Allison
>If Tika users will be happy to move on and drop Java 8 and/or javax. Please drop them :))) Fellow devs and broader Tika community, are we ok with EOL'ing Tika 2.x and dropping support for Java 8 and javax in September 2024?

[ANNOUNCE] Apache Tika 2.9.0 released

2023-08-28 Thread Tim Allison
, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

[RESULT][VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-28 Thread Tim Allison
This vote passes with 4 binding +1s, 1 non-binding +1 and no -1s. Binding +1s: Tim Allison Konstantin Gribov Tilman Hausherr Oleg Tikhonov Non-binding +1 Julien Nioche I'll release the artifacts and update the website shortly. Thank you, all! Best, Tim On Mon, Aug 28, 2023 at 10:

Re: [VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-28 Thread Tim Allison
elease > > Julien > > On Wed, 23 Aug 2023 at 15:50, Tim Allison wrote: > >> A candidate for the Tika 2.9.0 release is available at: >> https://dist.apache.org/repos/dist/dev/tika/2.9.0 >> >> The release candidate is a zip archive of the sources in: >> htt

[VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-23 Thread Tim Allison
A candidate for the Tika 2.9.0 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.9.0 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.9.0-rc1/ The SHA-512 checksum of the archive is 4b54172163a2e86b805e7077b11d21902dc2137a849eb0d

Re: Parser modifying file's access time

2023-08-23 Thread Tim Allison
I think this is a linux issue, not a Tika issue, e.g.: https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time On Mon, Aug 21, 2023 at 10:03 PM Gerardo Hernandez wrote: > Hi, > > I’d like to undestand why I’m experiencing the following behavior and if > it’s expected (as I

Re: setMaxContentLength Behavior Differs Across Parsers?

2023-08-14 Thread Tim Allison
Is it possible that this is due to extra whitespace in the PDF? On Sun, Jul 30, 2023 at 2:17 PM Keith Bennett wrote: > Hi, all. I am finally getting around to updating the "rika" Ruby gem for > interacting with Tika in JRuby, and encountered something weird. When I > test parsing a text file wit

Re: Content-Type Metadata

2023-08-14 Thread Tim Allison
Yes. Let us know if you find otherwise! On Mon, Aug 14, 2023 at 11:16 AM Keith Bennett wrote: > Tim, thank you so much for responding. Can I rely on Content-Type to > always be populated by a parse? > > - Keith > > > On Mon, Aug 14, 2023 at 10:09 PM Tim Allison wrote: >

Re: Using Tika with another OCR engine

2023-08-14 Thread Tim Allison
Concur with Nick. And, y, I'd frankly copy the TesseractOCRParser into a new module, rename it and modify it to call your OCR engine, build the jar and add the dependency to your tika bin directory (if you're using Docker?). On Thu, Aug 10, 2023 at 3:45 AM Cristian Zamfir wrote: > Hi Nick, > >

Re: Content-Type Metadata

2023-08-14 Thread Tim Allison
Content-Type may be more reliable/specific because for some file types, the parser updates the file type during the parse. For example the PDF parser updates application/pdf -> application/illustrator (or similar?) if the parser determines that the file is a PDF-based Adobe Illustrator file. The

Re: Tika 2.7.0 Java version supports

2023-07-25 Thread Tim Allison
We use 17 in our docker images and one of our ci/cd pipelines uses 18. Are you having problems? On Tue, Jul 25, 2023 at 6:00 AM Slava G wrote: > Hi, > Silly question, what is the highest Java version that Tika 2.7.0 supports ? > > Thanks >

Re: Tika memory usage using watchdog

2023-07-25 Thread Tim Allison
es on the -Xmx that's configured. On Tue, Jul 25, 2023 at 7:43 AM Cristian Zamfir wrote: > Hello, > > On 21 Jul 2023 at 23:51:54, Cristian Zamfir wrote: > >> Hi Tim! >> >> Sorry for the lack of details, adding now. >> >> On 21 Jul 2023 at 18:56:02, T

  1   2   3   4   5   6   7   >